You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/RAG_POC/mcp-tools.md

466 lines
10 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# MCP Tooling for ProxySQL RAG Engine (v0 Blueprint)
This document defines the MCP tool surface for querying ProxySQLs embedded RAG index. It is intended as a stable interface for AI agents. Internally, these tools query the SQLite schema described in `schema.sql` and the retrieval logic described in `architecture-runtime-retrieval.md`.
**Design goals**
- Stable tool contracts (do not break agents when internals change)
- Strict bounds (prevent unbounded scans / large outputs)
- Deterministic schemas (agents can reliably parse outputs)
- Separation of concerns:
- Retrieval returns identifiers and scores
- Fetch returns content
- Optional refetch returns authoritative source rows
---
## 1. Conventions
### 1.1 Identifiers
- `doc_id`: stable document identifier (e.g. `posts:12345`)
- `chunk_id`: stable chunk identifier (e.g. `posts:12345#0`)
- `source_id` / `source_name`: corresponds to `rag_sources`
### 1.2 Scores
- FTS score: `score_fts` (bm25; lower is better in SQLites bm25 by default)
- Vector score: `score_vec` (distance or similarity, depending on implementation)
- Hybrid score: `score` (normalized fused score; higher is better)
**Recommendation**
Normalize scores in MCP layer so:
- higher is always better for agent ranking
- raw internal ranking can still be returned as `score_fts_raw`, `distance_raw`, etc. if helpful
### 1.3 Limits and budgets (recommended defaults)
All tools should enforce caps, regardless of caller input:
- `k_max = 50`
- `candidates_max = 500`
- `query_max_bytes = 8192`
- `response_max_bytes = 5_000_000`
- `timeout_ms` (per tool): 2502000ms depending on tool type
Tools must return a `truncated` boolean if limits reduce output.
---
## 2. Shared filter model
Many tools accept the same filter structure. This is intentionally simple in v0.
### 2.1 Filter object
```json
{
"source_ids": [1,2],
"source_names": ["stack_posts"],
"doc_ids": ["posts:12345"],
"min_score": 5,
"post_type_ids": [1],
"tags_any": ["mysql","json"],
"tags_all": ["mysql","json"],
"created_after": "2022-01-01T00:00:00Z",
"created_before": "2025-01-01T00:00:00Z"
}
```
**Notes**
- In v0, most filters map to `metadata_json` values. Implementation can:
- filter in SQLite if JSON functions are available, or
- filter in MCP layer after initial retrieval (acceptable for small k/candidates)
- For production, denormalize hot filters into dedicated columns for speed.
### 2.2 Filter behavior
- If both `source_ids` and `source_names` are provided, treat as intersection.
- If no source filter is provided, default to all enabled sources **but** enforce a strict global budget.
---
## 3. Tool: `rag.search_fts`
Keyword search over `rag_fts_chunks`.
### 3.1 Request schema
```json
{
"query": "json_extract mysql",
"k": 10,
"offset": 0,
"filters": { },
"return": {
"include_title": true,
"include_metadata": true,
"include_snippets": false
}
}
```
### 3.2 Semantics
- Executes FTS query (MATCH) over indexed content.
- Returns top-k chunk matches with scores and identifiers.
- Does not return full chunk bodies unless `include_snippets` is requested (still bounded).
### 3.3 Response schema
```json
{
"results": [
{
"chunk_id": "posts:12345#0",
"doc_id": "posts:12345",
"source_id": 1,
"source_name": "stack_posts",
"score_fts": 0.73,
"title": "How to parse JSON in MySQL 8?",
"metadata": { "Tags": "<mysql><json>", "Score": "12" }
}
],
"truncated": false,
"stats": {
"k_requested": 10,
"k_returned": 10,
"ms": 12
}
}
```
---
## 4. Tool: `rag.search_vector`
Semantic search over `rag_vec_chunks`.
### 4.1 Request schema (text input)
```json
{
"query_text": "How do I extract JSON fields in MySQL?",
"k": 10,
"filters": { },
"embedding": {
"model": "text-embedding-3-large"
}
}
```
### 4.2 Request schema (precomputed vector)
```json
{
"query_embedding": {
"dim": 1536,
"values_b64": "AAAA..." // float32 array packed and base64 encoded
},
"k": 10,
"filters": { }
}
```
### 4.3 Semantics
- If `query_text` is provided, ProxySQL computes embedding internally (preferred for agents).
- If `query_embedding` is provided, ProxySQL uses it directly (useful for advanced clients).
- Returns nearest chunks by distance/similarity.
### 4.4 Response schema
```json
{
"results": [
{
"chunk_id": "posts:9876#1",
"doc_id": "posts:9876",
"source_id": 1,
"source_name": "stack_posts",
"score_vec": 0.82,
"title": "Query JSON columns efficiently",
"metadata": { "Tags": "<mysql><json>", "Score": "8" }
}
],
"truncated": false,
"stats": {
"k_requested": 10,
"k_returned": 10,
"ms": 18
}
}
```
---
## 5. Tool: `rag.search_hybrid`
Hybrid search combining FTS and vectors. Supports two modes:
- **Mode A**: parallel FTS + vector, fuse results (RRF recommended)
- **Mode B**: broad FTS candidate generation, then vector rerank
### 5.1 Request schema (Mode A: fuse)
```json
{
"query": "json_extract mysql",
"k": 10,
"filters": { },
"mode": "fuse",
"fuse": {
"fts_k": 50,
"vec_k": 50,
"rrf_k0": 60,
"w_fts": 1.0,
"w_vec": 1.0
}
}
```
### 5.2 Request schema (Mode B: candidates + rerank)
```json
{
"query": "json_extract mysql",
"k": 10,
"filters": { },
"mode": "fts_then_vec",
"fts_then_vec": {
"candidates_k": 200,
"rerank_k": 50,
"vec_metric": "cosine"
}
}
```
### 5.3 Semantics (Mode A)
1. Run FTS top `fts_k`
2. Run vector top `vec_k`
3. Merge candidates by `chunk_id`
4. Compute fused score (RRF recommended)
5. Return top `k`
### 5.4 Semantics (Mode B)
1. Run FTS top `candidates_k`
2. Compute vector similarity within those candidates
- either by joining candidate chunk_ids to stored vectors, or
- by embedding candidate chunk text on the fly (not recommended)
3. Return top `k` reranked results
4. Optionally return debug info about candidate stages
### 5.5 Response schema
```json
{
"results": [
{
"chunk_id": "posts:12345#0",
"doc_id": "posts:12345",
"source_id": 1,
"source_name": "stack_posts",
"score": 0.91,
"score_fts": 0.74,
"score_vec": 0.86,
"title": "How to parse JSON in MySQL 8?",
"metadata": { "Tags": "<mysql><json>", "Score": "12" },
"debug": {
"rank_fts": 3,
"rank_vec": 6
}
}
],
"truncated": false,
"stats": {
"mode": "fuse",
"k_requested": 10,
"k_returned": 10,
"ms": 27
}
}
```
---
## 6. Tool: `rag.get_chunks`
Fetch chunk bodies by chunk_id. This is how agents obtain grounding text.
### 6.1 Request schema
```json
{
"chunk_ids": ["posts:12345#0", "posts:9876#1"],
"return": {
"include_title": true,
"include_doc_metadata": true,
"include_chunk_metadata": true
}
}
```
### 6.2 Response schema
```json
{
"chunks": [
{
"chunk_id": "posts:12345#0",
"doc_id": "posts:12345",
"title": "How to parse JSON in MySQL 8?",
"body": "<p>I tried JSON_EXTRACT...</p>",
"doc_metadata": { "Tags": "<mysql><json>", "Score": "12" },
"chunk_metadata": { "chunk_index": 0 }
}
],
"truncated": false,
"stats": { "ms": 6 }
}
```
**Hard limit recommendation**
- Cap total returned chunk bytes to a safe maximum (e.g. 12 MB).
---
## 7. Tool: `rag.get_docs`
Fetch full canonical documents by doc_id (not chunks). Useful for inspection or compact docs.
### 7.1 Request schema
```json
{
"doc_ids": ["posts:12345"],
"return": {
"include_body": true,
"include_metadata": true
}
}
```
### 7.2 Response schema
```json
{
"docs": [
{
"doc_id": "posts:12345",
"source_id": 1,
"source_name": "stack_posts",
"pk_json": { "Id": 12345 },
"title": "How to parse JSON in MySQL 8?",
"body": "<p>...</p>",
"metadata": { "Tags": "<mysql><json>", "Score": "12" }
}
],
"truncated": false,
"stats": { "ms": 7 }
}
```
---
## 8. Tool: `rag.fetch_from_source`
Refetch authoritative rows from the source DB using `doc_id` (via pk_json).
### 8.1 Request schema
```json
{
"doc_ids": ["posts:12345"],
"columns": ["Id","Title","Body","Tags","Score"],
"limits": {
"max_rows": 10,
"max_bytes": 200000
}
}
```
### 8.2 Semantics
- Look up doc(s) in `rag_documents` to get `source_id` and `pk_json`
- Resolve source connection from `rag_sources`
- Execute a parameterized query by primary key
- Return requested columns only
- Enforce strict limits
### 8.3 Response schema
```json
{
"rows": [
{
"doc_id": "posts:12345",
"source_name": "stack_posts",
"row": {
"Id": 12345,
"Title": "How to parse JSON in MySQL 8?",
"Score": 12
}
}
],
"truncated": false,
"stats": { "ms": 22 }
}
```
**Security note**
- This tool must not allow arbitrary SQL.
- Only allow fetching by primary key and a whitelist of columns.
---
## 9. Tool: `rag.admin.stats` (recommended)
Operational visibility for dashboards and debugging.
### 9.1 Request
```json
{}
```
### 9.2 Response
```json
{
"sources": [
{
"source_id": 1,
"source_name": "stack_posts",
"docs": 123456,
"chunks": 456789,
"last_sync": null
}
],
"stats": { "ms": 5 }
}
```
---
## 10. Tool: `rag.admin.sync` (optional in v0; required in v1)
Kicks ingestion for a source or all sources. In v0, ingestion may run as a separate process; in ProxySQL product form, this would trigger an internal job.
### 10.1 Request
```json
{
"source_names": ["stack_posts"]
}
```
### 10.2 Response
```json
{
"accepted": true,
"job_id": "sync-2026-01-19T10:00:00Z"
}
```
---
## 11. Implementation notes (what the coding agent should implement)
1. **Input validation and caps** for every tool.
2. **Consistent filtering** across FTS/vector/hybrid.
3. **Stable scoring semantics** (higher-is-better recommended).
4. **Efficient joins**:
- vector search returns chunk_ids; join to `rag_chunks`/`rag_documents` for metadata.
5. **Hybrid modes**:
- Mode A (fuse): implement RRF
- Mode B (fts_then_vec): candidate set then vector rerank
6. **Error model**:
- return structured errors with codes (e.g. `INVALID_ARGUMENT`, `LIMIT_EXCEEDED`, `INTERNAL`)
7. **Observability**:
- return `stats.ms` in responses
- track tool usage counters and latency histograms
---
## 12. Summary
These MCP tools define a stable retrieval interface:
- Search: `rag.search_fts`, `rag.search_vector`, `rag.search_hybrid`
- Fetch: `rag.get_chunks`, `rag.get_docs`, `rag.fetch_from_source`
- Admin: `rag.admin.stats`, optionally `rag.admin.sync`