# ProxySQL RAG Engine — Runtime Retrieval Architecture (v0 Blueprint) This document describes how ProxySQL becomes a **RAG retrieval engine** at runtime. The companion document (Data Model & Ingestion) explains how content enters the SQLite index. This document explains how content is **queried**, how results are **returned to agents/applications**, and how **hybrid retrieval** works in practice. It is written as an implementation blueprint for ProxySQL (and its MCP server) and assumes the SQLite schema contains: - `rag_sources` (control plane) - `rag_documents` (canonical docs) - `rag_chunks` (retrieval units) - `rag_fts_chunks` (FTS5) - `rag_vec_chunks` (sqlite3-vec vectors) --- ## 1. The runtime role of ProxySQL in a RAG system ProxySQL becomes a RAG runtime by providing four capabilities in one bounded service: 1. **Retrieval Index Host** - Hosts the SQLite index and search primitives (FTS + vectors). - Offers deterministic query semantics and strict budgets. 2. **Orchestration Layer** - Implements search flows (FTS, vector, hybrid, rerank). - Applies filters, caps, and result shaping. 3. **Stable API Surface (MCP-first)** - LLM agents call MCP tools (not raw SQL). - Tool contracts remain stable even if internal storage changes. 4. **Authoritative Row Refetch Gateway** - After retrieval returns `doc_id` / `pk_json`, ProxySQL can refetch the authoritative row from the source DB on-demand (optional). - This avoids returning stale or partial data when the full row is needed. In production terms, this is not “ProxySQL as a general search engine.” It is a **bounded retrieval service** colocated with database access logic. --- ## 2. High-level query flow (agent-centric) A typical RAG flow has two phases: ### Phase A — Retrieval (fast, bounded, cheap) - Query the index to obtain a small number of relevant chunks (and their parent doc identity). - Output includes `chunk_id`, `doc_id`, `score`, and small metadata. ### Phase B — Fetch (optional, authoritative, bounded) - If the agent needs full context or structured fields, it refetches the authoritative row from the source DB using `pk_json`. - This avoids scanning large tables and avoids shipping huge payloads in Phase A. **Canonical flow** 1. `rag.search_hybrid(query, filters, k)` → returns top chunk ids and scores 2. `rag.get_chunks(chunk_ids)` → returns chunk text for prompt grounding/citations 3. Optional: `rag.fetch_from_source(doc_id)` → returns full row or selected columns --- ## 3. Runtime interfaces: MCP vs SQL ProxySQL should support two “consumption modes”: ### 3.1 MCP tools (preferred for AI agents) - Strict limits and predictable response schemas. - Tools return structured results and avoid SQL injection concerns. - Agents do not need direct DB access. ### 3.2 SQL access (for standard applications / debugging) - Applications may connect to ProxySQL’s SQLite admin interface (or a dedicated port) and issue SQL. - Useful for: - internal dashboards - troubleshooting - non-agent apps that want retrieval but speak SQL **Principle** - MCP is the stable, long-term interface. - SQL is optional and may be restricted to trusted callers. --- ## 4. Retrieval primitives ### 4.1 FTS retrieval (keyword / exact match) FTS5 is used for: - error messages - identifiers and function names - tags and exact terms - “grep-like” queries **Typical output** - `chunk_id`, `score_fts`, optional highlights/snippets **Ranking** - `bm25(rag_fts_chunks)` is the default. It is fast and effective for term queries. ### 4.2 Vector retrieval (semantic similarity) Vector search is used for: - paraphrased questions - semantic similarity (“how to do X” vs “best way to achieve X”) - conceptual matching that is poor with keyword-only search **Typical output** - `chunk_id`, `score_vec` (distance/similarity), plus join metadata **Important** - Vectors are generally computed per chunk. - Filters are applied via `source_id` and joins to `rag_chunks` / `rag_documents`. --- ## 5. Hybrid retrieval patterns (two recommended modes) Hybrid retrieval combines FTS and vector search for better quality than either alone. Two concrete modes should be implemented because they solve different problems. ### Mode 1 — “Best of both” (parallel FTS + vector; fuse results) **Use when** - the query may contain both exact tokens (e.g. error messages) and semantic intent **Flow** 1. Run FTS top-N (e.g. N=50) 2. Run vector top-N (e.g. N=50) 3. Merge results by `chunk_id` 4. Score fusion (recommended): Reciprocal Rank Fusion (RRF) 5. Return top-k (e.g. k=10) **Why RRF** - Robust without score calibration - Works across heterogeneous score ranges (bm25 vs cosine distance) **RRF formula** - For each candidate chunk: - `score = w_fts/(k0 + rank_fts) + w_vec/(k0 + rank_vec)` - Typical: `k0=60`, `w_fts=1.0`, `w_vec=1.0` ### Mode 2 — “Broad FTS then vector refine” (candidate generation + rerank) **Use when** - you want strong precision anchored to exact term matches - you want to avoid vector search over the entire corpus **Flow** 1. Run broad FTS query top-M (e.g. M=200) 2. Fetch chunk texts for those candidates 3. Compute vector similarity of query embedding to candidate embeddings 4. Return top-k This mode behaves like a two-stage retrieval pipeline: - Stage 1: cheap recall (FTS) - Stage 2: precise semantic rerank within candidates --- ## 6. Filters, constraints, and budgets (blast-radius control) A RAG retrieval engine must be bounded. ProxySQL should enforce limits at the MCP layer and ideally also at SQL helper functions. ### 6.1 Hard caps (recommended defaults) - Maximum `k` returned: 50 - Maximum candidates for broad-stage: 200–500 - Maximum query length: e.g. 2–8 KB - Maximum response bytes: e.g. 1–5 MB - Maximum execution time per request: e.g. 50–250 ms for retrieval, 1–2 s for fetch ### 6.2 Filter semantics Filters should be applied consistently across retrieval modes. Common filters: - `source_id` or `source_name` - tag include/exclude (via metadata_json parsing or pre-extracted tag fields later) - post type (question vs answer) - minimum score - time range (creation date / last activity) Implementation note: - v0 stores metadata in JSON; filtering can be implemented in MCP layer or via SQLite JSON functions (if enabled). - For performance, later versions should denormalize key metadata into dedicated columns or side tables. --- ## 7. Result shaping and what the caller receives A retrieval response must be designed for downstream LLM usage: ### 7.1 Retrieval results (Phase A) Return a compact list of “evidence candidates”: - `chunk_id` - `doc_id` - `scores` (fts, vec, fused) - short `title` - minimal metadata (source, tags, timestamp, etc.) Do **not** return full bodies by default; that is what `rag.get_chunks` is for. ### 7.2 Chunk fetch results (Phase A.2) `rag.get_chunks(chunk_ids)` returns: - `chunk_id`, `doc_id` - `title` - `body` (chunk text) - optionally a snippet/highlight for display ### 7.3 Source refetch results (Phase B) `rag.fetch_from_source(doc_id)` returns: - either the full row - or a selected subset of columns (recommended) This is the “authoritative fetch” boundary that prevents stale/partial index usage from being a correctness problem. --- ## 8. SQL examples (runtime extraction) These are not the preferred agent interface, but they are crucial for debugging and for SQL-native apps. ### 8.1 FTS search (top 10) ```sql SELECT f.chunk_id, bm25(rag_fts_chunks) AS score_fts FROM rag_fts_chunks f WHERE rag_fts_chunks MATCH 'json_extract mysql' ORDER BY score_fts LIMIT 10; ``` Join to fetch text: ```sql SELECT f.chunk_id, bm25(rag_fts_chunks) AS score_fts, c.doc_id, c.body FROM rag_fts_chunks f JOIN rag_chunks c ON c.chunk_id = f.chunk_id WHERE rag_fts_chunks MATCH 'json_extract mysql' ORDER BY score_fts LIMIT 10; ``` ### 8.2 Vector search (top 10) Vector syntax depends on how you expose query vectors. A typical pattern is: 1) Bind a query vector into a function / parameter 2) Use `rag_vec_chunks` to return nearest neighbors Example shape (conceptual): ```sql -- Pseudocode: nearest neighbors for :query_embedding SELECT v.chunk_id, v.distance FROM rag_vec_chunks v WHERE v.embedding MATCH :query_embedding ORDER BY v.distance LIMIT 10; ``` In production, ProxySQL MCP will typically compute the query embedding and call SQL internally with a bound parameter. --- ## 9. MCP tools (runtime API surface) This document does not define full schemas (that is in `mcp-tools.md`), but it defines what each tool must do. ### 9.1 Retrieval - `rag.search_fts(query, filters, k)` - `rag.search_vector(query_text | query_embedding, filters, k)` - `rag.search_hybrid(query, mode, filters, k, params)` - Mode 1: parallel + RRF fuse - Mode 2: broad FTS candidates + vector rerank ### 9.2 Fetch - `rag.get_chunks(chunk_ids)` - `rag.get_docs(doc_ids)` - `rag.fetch_from_source(doc_ids | pk_json, columns?, limits?)` **MCP-first principle** - Agents do not see SQLite schema or SQL. - MCP tools remain stable even if you move index storage out of ProxySQL later. --- ## 10. Operational considerations ### 10.1 Dedicated ProxySQL instance Run GenAI retrieval in a dedicated ProxySQL instance to reduce blast radius: - independent CPU/memory budgets - independent configuration and rate limits - independent failure domain ### 10.2 Observability and metrics (minimum) - count of docs/chunks per source - query counts by tool and source - p50/p95 latency for: - FTS - vector - hybrid - refetch - dropped/limited requests (rate limit hit, cap exceeded) - error rate and error categories ### 10.3 Safety controls - strict upper bounds on `k` and candidate sizes - strict timeouts - response size caps - optional allowlists for sources accessible to agents - tenant boundaries via filters (strongly recommended for multi-tenant) --- ## 11. Recommended “v0-to-v1” evolution checklist ### v0 (PoC) - ingestion to docs/chunks - FTS search - vector search (if embedding pipeline available) - simple hybrid search - chunk fetch - manual/limited source refetch ### v1 (product hardening) - incremental sync checkpoints (`rag_sync_state`) - update detection (hashing/versioning) - delete handling - robust hybrid search: - RRF fuse - candidate-generation rerank - stronger filtering semantics (denormalized metadata columns) - quotas, rate limits, per-source budgets - full MCP tool contracts + tests --- ## 12. Summary At runtime, ProxySQL RAG retrieval is implemented as: - **Index query** (FTS/vector/hybrid) returning a small set of chunk IDs - **Chunk fetch** returning the text that the LLM will ground on - Optional **authoritative refetch** from the source DB by primary key - Strict limits and consistent filtering to keep the service bounded