11 KiB
ProxySQL RAG Engine — Runtime Retrieval Architecture (v0 Blueprint)
This document describes how ProxySQL becomes a RAG retrieval engine at runtime. The companion document (Data Model & Ingestion) explains how content enters the SQLite index. This document explains how content is queried, how results are returned to agents/applications, and how hybrid retrieval works in practice.
It is written as an implementation blueprint for ProxySQL (and its MCP server) and assumes the SQLite schema contains:
rag_sources(control plane)rag_documents(canonical docs)rag_chunks(retrieval units)rag_fts_chunks(FTS5)rag_vec_chunks(sqlite3-vec vectors)
1. The runtime role of ProxySQL in a RAG system
ProxySQL becomes a RAG runtime by providing four capabilities in one bounded service:
-
Retrieval Index Host
- Hosts the SQLite index and search primitives (FTS + vectors).
- Offers deterministic query semantics and strict budgets.
-
Orchestration Layer
- Implements search flows (FTS, vector, hybrid, rerank).
- Applies filters, caps, and result shaping.
-
Stable API Surface (MCP-first)
- LLM agents call MCP tools (not raw SQL).
- Tool contracts remain stable even if internal storage changes.
-
Authoritative Row Refetch Gateway
- After retrieval returns
doc_id/pk_json, ProxySQL can refetch the authoritative row from the source DB on-demand (optional). - This avoids returning stale or partial data when the full row is needed.
- After retrieval returns
In production terms, this is not “ProxySQL as a general search engine.” It is a bounded retrieval service colocated with database access logic.
2. High-level query flow (agent-centric)
A typical RAG flow has two phases:
Phase A — Retrieval (fast, bounded, cheap)
- Query the index to obtain a small number of relevant chunks (and their parent doc identity).
- Output includes
chunk_id,doc_id,score, and small metadata.
Phase B — Fetch (optional, authoritative, bounded)
- If the agent needs full context or structured fields, it refetches the authoritative row from the source DB using
pk_json. - This avoids scanning large tables and avoids shipping huge payloads in Phase A.
Canonical flow
rag.search_hybrid(query, filters, k)→ returns top chunk ids and scoresrag.get_chunks(chunk_ids)→ returns chunk text for prompt grounding/citations- Optional:
rag.fetch_from_source(doc_id)→ returns full row or selected columns
3. Runtime interfaces: MCP vs SQL
ProxySQL should support two “consumption modes”:
3.1 MCP tools (preferred for AI agents)
- Strict limits and predictable response schemas.
- Tools return structured results and avoid SQL injection concerns.
- Agents do not need direct DB access.
3.2 SQL access (for standard applications / debugging)
- Applications may connect to ProxySQL’s SQLite admin interface (or a dedicated port) and issue SQL.
- Useful for:
- internal dashboards
- troubleshooting
- non-agent apps that want retrieval but speak SQL
Principle
- MCP is the stable, long-term interface.
- SQL is optional and may be restricted to trusted callers.
4. Retrieval primitives
4.1 FTS retrieval (keyword / exact match)
FTS5 is used for:
- error messages
- identifiers and function names
- tags and exact terms
- “grep-like” queries
Typical output
chunk_id,score_fts, optional highlights/snippets
Ranking
bm25(rag_fts_chunks)is the default. It is fast and effective for term queries.
4.2 Vector retrieval (semantic similarity)
Vector search is used for:
- paraphrased questions
- semantic similarity (“how to do X” vs “best way to achieve X”)
- conceptual matching that is poor with keyword-only search
Typical output
chunk_id,score_vec(distance/similarity), plus join metadata
Important
- Vectors are generally computed per chunk.
- Filters are applied via
source_idand joins torag_chunks/rag_documents.
5. Hybrid retrieval patterns (two recommended modes)
Hybrid retrieval combines FTS and vector search for better quality than either alone. Two concrete modes should be implemented because they solve different problems.
Mode 1 — “Best of both” (parallel FTS + vector; fuse results)
Use when
- the query may contain both exact tokens (e.g. error messages) and semantic intent
Flow
- Run FTS top-N (e.g. N=50)
- Run vector top-N (e.g. N=50)
- Merge results by
chunk_id - Score fusion (recommended): Reciprocal Rank Fusion (RRF)
- Return top-k (e.g. k=10)
Why RRF
- Robust without score calibration
- Works across heterogeneous score ranges (bm25 vs cosine distance)
RRF formula
- For each candidate chunk:
score = w_fts/(k0 + rank_fts) + w_vec/(k0 + rank_vec)- Typical:
k0=60,w_fts=1.0,w_vec=1.0
Mode 2 — “Broad FTS then vector refine” (candidate generation + rerank)
Use when
- you want strong precision anchored to exact term matches
- you want to avoid vector search over the entire corpus
Flow
- Run broad FTS query top-M (e.g. M=200)
- Fetch chunk texts for those candidates
- Compute vector similarity of query embedding to candidate embeddings
- Return top-k
This mode behaves like a two-stage retrieval pipeline:
- Stage 1: cheap recall (FTS)
- Stage 2: precise semantic rerank within candidates
6. Filters, constraints, and budgets (blast-radius control)
A RAG retrieval engine must be bounded. ProxySQL should enforce limits at the MCP layer and ideally also at SQL helper functions.
6.1 Hard caps (recommended defaults)
- Maximum
kreturned: 50 - Maximum candidates for broad-stage: 200–500
- Maximum query length: e.g. 2–8 KB
- Maximum response bytes: e.g. 1–5 MB
- Maximum execution time per request: e.g. 50–250 ms for retrieval, 1–2 s for fetch
6.2 Filter semantics
Filters should be applied consistently across retrieval modes.
Common filters:
source_idorsource_name- tag include/exclude (via metadata_json parsing or pre-extracted tag fields later)
- post type (question vs answer)
- minimum score
- time range (creation date / last activity)
Implementation note:
- v0 stores metadata in JSON; filtering can be implemented in MCP layer or via SQLite JSON functions (if enabled).
- For performance, later versions should denormalize key metadata into dedicated columns or side tables.
7. Result shaping and what the caller receives
A retrieval response must be designed for downstream LLM usage:
7.1 Retrieval results (Phase A)
Return a compact list of “evidence candidates”:
chunk_iddoc_idscores(fts, vec, fused)- short
title - minimal metadata (source, tags, timestamp, etc.)
Do not return full bodies by default; that is what rag.get_chunks is for.
7.2 Chunk fetch results (Phase A.2)
rag.get_chunks(chunk_ids) returns:
chunk_id,doc_idtitlebody(chunk text)- optionally a snippet/highlight for display
7.3 Source refetch results (Phase B)
rag.fetch_from_source(doc_id) returns:
- either the full row
- or a selected subset of columns (recommended)
This is the “authoritative fetch” boundary that prevents stale/partial index usage from being a correctness problem.
8. SQL examples (runtime extraction)
These are not the preferred agent interface, but they are crucial for debugging and for SQL-native apps.
8.1 FTS search (top 10)
SELECT
f.chunk_id,
bm25(rag_fts_chunks) AS score_fts
FROM rag_fts_chunks f
WHERE rag_fts_chunks MATCH 'json_extract mysql'
ORDER BY score_fts
LIMIT 10;
Join to fetch text:
SELECT
f.chunk_id,
bm25(rag_fts_chunks) AS score_fts,
c.doc_id,
c.body
FROM rag_fts_chunks f
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
WHERE rag_fts_chunks MATCH 'json_extract mysql'
ORDER BY score_fts
LIMIT 10;
8.2 Vector search (top 10)
Vector syntax depends on how you expose query vectors. A typical pattern is:
- Bind a query vector into a function / parameter
- Use
rag_vec_chunksto return nearest neighbors
Example shape (conceptual):
-- Pseudocode: nearest neighbors for :query_embedding
SELECT
v.chunk_id,
v.distance
FROM rag_vec_chunks v
WHERE v.embedding MATCH :query_embedding
ORDER BY v.distance
LIMIT 10;
In production, ProxySQL MCP will typically compute the query embedding and call SQL internally with a bound parameter.
9. MCP tools (runtime API surface)
This document does not define full schemas (that is in mcp-tools.md), but it defines what each tool must do.
9.1 Retrieval
rag.search_fts(query, filters, k)rag.search_vector(query_text | query_embedding, filters, k)rag.search_hybrid(query, mode, filters, k, params)- Mode 1: parallel + RRF fuse
- Mode 2: broad FTS candidates + vector rerank
9.2 Fetch
rag.get_chunks(chunk_ids)rag.get_docs(doc_ids)rag.fetch_from_source(doc_ids | pk_json, columns?, limits?)
MCP-first principle
- Agents do not see SQLite schema or SQL.
- MCP tools remain stable even if you move index storage out of ProxySQL later.
10. Operational considerations
10.1 Dedicated ProxySQL instance
Run GenAI retrieval in a dedicated ProxySQL instance to reduce blast radius:
- independent CPU/memory budgets
- independent configuration and rate limits
- independent failure domain
10.2 Observability and metrics (minimum)
- count of docs/chunks per source
- query counts by tool and source
- p50/p95 latency for:
- FTS
- vector
- hybrid
- refetch
- dropped/limited requests (rate limit hit, cap exceeded)
- error rate and error categories
10.3 Safety controls
- strict upper bounds on
kand candidate sizes - strict timeouts
- response size caps
- optional allowlists for sources accessible to agents
- tenant boundaries via filters (strongly recommended for multi-tenant)
11. Recommended “v0-to-v1” evolution checklist
v0 (PoC)
- ingestion to docs/chunks
- FTS search
- vector search (if embedding pipeline available)
- simple hybrid search
- chunk fetch
- manual/limited source refetch
v1 (product hardening)
- incremental sync checkpoints (
rag_sync_state) - update detection (hashing/versioning)
- delete handling
- robust hybrid search:
- RRF fuse
- candidate-generation rerank
- stronger filtering semantics (denormalized metadata columns)
- quotas, rate limits, per-source budgets
- full MCP tool contracts + tests
12. Summary
At runtime, ProxySQL RAG retrieval is implemented as:
- Index query (FTS/vector/hybrid) returning a small set of chunk IDs
- Chunk fetch returning the text that the LLM will ground on
- Optional authoritative refetch from the source DB by primary key
- Strict limits and consistent filtering to keep the service bounded