Add RAG capability blueprint documents

These documents serve as blueprints for implementing RAG (Retrieval-Augmented Generation) capabilities in ProxySQL:

- schema.sql: Database schema for RAG implementation
- rag_ingest.cpp: PoC ingester blueprint to be integrated into ProxySQL
- architecture-data-model.md: Data model architecture for RAG
- architecture-runtime-retrieval.md: Runtime retrieval architecture
- mcp-tools.md: MCP tools integration design
- sql-examples.md: SQL usage examples for RAG
- embeddings-design.md: Embeddings design for vector search

These files will guide the upcoming RAG implementation in ProxySQL.
pull/5318/head
Rene Cannao 3 months ago
parent 994bafa31f
commit 803115f504

@ -0,0 +1,384 @@
# ProxySQL RAG Index — Data Model & Ingestion Architecture (v0 Blueprint)
This document explains the SQLite data model used to turn relational tables (e.g. MySQL `posts`) into a retrieval-friendly index hosted inside ProxySQL. It focuses on:
- What each SQLite table does
- How tables relate to each other
- How `rag_sources` defines **explicit mapping rules** (no guessing)
- How ingestion transforms rows into documents and chunks
- How FTS and vector indexes are maintained
- What evolves later for incremental sync and updates
---
## 1. Goal and core idea
Relational databases are excellent for structured queries, but RAG-style retrieval needs:
- Fast keyword search (error messages, identifiers, tags)
- Fast semantic search (similar meaning, paraphrased questions)
- A stable way to “refetch the authoritative data” from the source DB
The model below implements a **canonical document layer** inside ProxySQL:
1. Ingest selected rows from a source database (MySQL, PostgreSQL, etc.)
2. Convert each row into a **document** (title/body + metadata)
3. Split long bodies into **chunks**
4. Index chunks in:
- **FTS5** for keyword search
- **sqlite3-vec** for vector similarity
5. Serve retrieval through stable APIs (MCP or SQL), independent of where indexes physically live in the future
---
## 2. The SQLite tables (what they are and why they exist)
### 2.1 `rag_sources` — control plane: “what to ingest and how”
**Purpose**
- Defines each ingestion source (a table or view in an external DB)
- Stores *explicit* transformation rules:
- which columns become `title`, `body`
- which columns go into `metadata_json`
- how to build `doc_id`
- Stores chunking strategy and embedding strategy configuration
**Key columns**
- `backend_*`: how to connect (v0 connects directly; later may be “via ProxySQL”)
- `table_name`, `pk_column`: what to ingest
- `where_sql`: optional restriction (e.g. only questions)
- `doc_map_json`: mapping rules (required)
- `chunking_json`: chunking rules (required)
- `embedding_json`: embedding rules (optional)
**Important**: `rag_sources` is the **only place** that defines mapping logic.
A general-purpose ingester must never “guess” which fields belong to `body` or metadata.
---
### 2.2 `rag_documents` — canonical documents: “one per source row”
**Purpose**
- Represents the canonical document created from a single source row.
- Stores:
- a stable identifier (`doc_id`)
- a refetch pointer (`pk_json`)
- document text (`title`, `body`)
- structured metadata (`metadata_json`)
**Why store full `body` here?**
- Enables re-chunking later without re-fetching from the source DB.
- Makes debugging and inspection easier.
- Supports future update detection and diffing.
**Key columns**
- `doc_id` (PK): stable across runs and machines (e.g. `"posts:12345"`)
- `source_id`: ties back to `rag_sources`
- `pk_json`: how to refetch the authoritative row later (e.g. `{"Id":12345}`)
- `title`, `body`: canonical text
- `metadata_json`: non-text signals used for filters/boosting
- `updated_at`, `deleted`: lifecycle fields for incremental sync later
---
### 2.3 `rag_chunks` — retrieval units: “one or many per document”
**Purpose**
- Stores chunked versions of a documents text.
- Retrieval and embeddings are performed at the chunk level for better quality.
**Why chunk at all?**
- Long bodies reduce retrieval quality:
- FTS returns large documents where only a small part is relevant
- Vector embeddings of large texts smear multiple topics together
- Chunking yields:
- better precision
- better citations (“this chunk”) and smaller context
- cheaper updates (only re-embed changed chunks later)
**Key columns**
- `chunk_id` (PK): stable, derived from doc_id + chunk index (e.g. `"posts:12345#0"`)
- `doc_id` (FK): parent document
- `source_id`: convenience for filtering without joining documents
- `chunk_index`: 0..N-1
- `title`, `body`: chunk text (often title repeated for context)
- `metadata_json`: optional chunk-level metadata (offsets, “has_code”, section label)
- `updated_at`, `deleted`: lifecycle for later incremental sync
---
### 2.4 `rag_fts_chunks` — FTS5 index (contentless)
**Purpose**
- Keyword search index for chunks.
- Best for:
- exact terms
- identifiers
- error messages
- tags and code tokens (depending on tokenization)
**Design choice: contentless FTS**
- The FTS virtual table does not automatically mirror `rag_chunks`.
- The ingester explicitly inserts into FTS as chunks are created.
- This makes ingestion deterministic and avoids surprises when chunk bodies change later.
**Stored fields**
- `chunk_id` (unindexed, acts like a row identifier)
- `title`, `body` (indexed)
---
### 2.5 `rag_vec_chunks` — vector index (sqlite3-vec)
**Purpose**
- Semantic similarity search over chunks.
- Each chunk has a vector embedding.
**Key columns**
- `embedding float[DIM]`: embedding vector (DIM must match your model)
- `chunk_id`: join key to `rag_chunks`
- Optional metadata columns:
- `doc_id`, `source_id`, `updated_at`
- These help filtering and joining and are valuable for performance.
**Note**
- The ingester decides what text is embedded (chunk body alone, or “Title + Tags + Body chunk”).
---
### 2.6 Optional convenience objects
- `rag_chunk_view`: joins `rag_chunks` with `rag_documents` for debugging/inspection
- `rag_sync_state`: reserved for incremental sync later (not used in v0)
---
## 3. Table relationships (the graph)
Think of this as a data pipeline graph:
```text
rag_sources
(defines mapping + chunking + embedding)
|
v
rag_documents (1 row per source row)
|
v
rag_chunks (1..N chunks per document)
/ \
v v
rag_fts rag_vec
```
**Cardinality**
- `rag_sources (1) -> rag_documents (N)`
- `rag_documents (1) -> rag_chunks (N)`
- `rag_chunks (1) -> rag_fts_chunks (1)` (insertion done by ingester)
- `rag_chunks (1) -> rag_vec_chunks (0/1+)` (0 if embeddings disabled; 1 typically)
---
## 4. How mapping is defined (no guessing)
### 4.1 Why `doc_map_json` exists
A general-purpose system cannot infer that:
- `posts.Body` should become document body
- `posts.Title` should become title
- `Score`, `Tags`, `CreationDate`, etc. should become metadata
- Or how to concatenate fields
Therefore, `doc_map_json` is required.
### 4.2 `doc_map_json` structure (v0)
`doc_map_json` defines:
- `doc_id.format`: string template with `{ColumnName}` placeholders
- `title.concat`: concatenation spec
- `body.concat`: concatenation spec
- `metadata.pick`: list of column names to include in metadata JSON
- `metadata.rename`: mapping of old key -> new key (useful for typos or schema differences)
**Concatenation parts**
- `{"col":"Column"}` — appends the column value (if present)
- `{"lit":"..."} ` — appends a literal string
Example (posts-like):
```json
{
"doc_id": { "format": "posts:{Id}" },
"title": { "concat": [ { "col": "Title" } ] },
"body": { "concat": [ { "col": "Body" } ] },
"metadata": {
"pick": ["Id","PostTypeId","Tags","Score","CreaionDate"],
"rename": {"CreaionDate":"CreationDate"}
}
}
```
---
## 5. Chunking strategy definition
### 5.1 Why chunking is configured per source
Different tables need different chunking:
- StackOverflow `Body` may be long -> chunking recommended
- Small “reference” tables may not need chunking at all
Thus chunking is stored in `rag_sources.chunking_json`.
### 5.2 `chunking_json` structure (v0)
v0 supports **chars-based** chunking (simple, robust).
```json
{
"enabled": true,
"unit": "chars",
"chunk_size": 4000,
"overlap": 400,
"min_chunk_size": 800
}
```
**Behavior**
- If `body.length <= chunk_size` -> one chunk
- Else chunks of `chunk_size` with `overlap`
- Avoid tiny final chunks by appending the tail to the previous chunk if below `min_chunk_size`
**Why overlap matters**
- Prevents splitting a key sentence or code snippet across boundaries
- Improves both FTS and semantic retrieval consistency
---
## 6. Embedding strategy definition (where it fits in the model)
### 6.1 Why embeddings are per chunk
- Better retrieval precision
- Smaller context per match
- Allows partial updates later (only re-embed changed chunks)
### 6.2 `embedding_json` structure (v0)
```json
{
"enabled": true,
"dim": 1536,
"model": "text-embedding-3-large",
"input": { "concat": [
{"col":"Title"},
{"lit":"\nTags: "}, {"col":"Tags"},
{"lit":"\n\n"},
{"chunk_body": true}
]}
}
```
**Meaning**
- Build embedding input text from:
- title
- tags (as plain text)
- chunk body
This improves semantic retrieval for question-like content without embedding numeric metadata.
---
## 7. Ingestion lifecycle (step-by-step)
For each enabled `rag_sources` entry:
1. **Connect** to source DB using `backend_*`
2. **Select rows** from `table_name` (and optional `where_sql`)
- Select only needed columns determined by `doc_map_json` and `embedding_json`
3. For each row:
- Build `doc_id` using `doc_map_json.doc_id.format`
- Build `pk_json` from `pk_column`
- Build `title` using `title.concat`
- Build `body` using `body.concat`
- Build `metadata_json` using `metadata.pick` and `metadata.rename`
4. **Skip** if `doc_id` already exists (v0 behavior)
5. Insert into `rag_documents`
6. Chunk `body` using `chunking_json`
7. For each chunk:
- Insert into `rag_chunks`
- Insert into `rag_fts_chunks`
- If embeddings enabled:
- Build embedding input text using `embedding_json.input`
- Compute embedding
- Insert into `rag_vec_chunks`
8. Commit (ideally in a transaction for performance)
---
## 8. What changes later (incremental sync and updates)
v0 is “insert-only and skip-existing.”
Product-grade ingestion requires:
### 8.1 Detecting changes
Options:
- Watermark by `LastActivityDate` / `updated_at` column
- Hash (e.g. `sha256(title||body||metadata)`) stored in documents table
- Compare chunk hashes to re-embed only changed chunks
### 8.2 Updating and deleting
Needs:
- Upsert documents
- Delete or mark `deleted=1` when source row deleted
- Rebuild chunks and indexes when body changes
- Maintain FTS rows:
- delete old chunk rows from FTS
- insert updated chunk rows
### 8.3 Checkpoints
Use `rag_sync_state` to store:
- last ingested timestamp
- GTID/LSN for CDC
- or a monotonic PK watermark
The current schema already includes:
- `updated_at` and `deleted`
- `rag_sync_state` placeholder
So incremental sync can be added without breaking the data model.
---
## 9. Practical example: mapping `posts` table
Given a MySQL `posts` row:
- `Id = 12345`
- `Title = "How to parse JSON in MySQL 8?"`
- `Body = "<p>I tried JSON_EXTRACT...</p>"`
- `Tags = "<mysql><json>"`
- `Score = 12`
With mapping:
- `doc_id = "posts:12345"`
- `title = Title`
- `body = Body`
- `metadata_json` includes `{ "Tags": "...", "Score": "12", ... }`
- chunking splits body into:
- `posts:12345#0`, `posts:12345#1`, etc.
- FTS is populated with the chunk text
- vectors are stored per chunk
---
## 10. Summary
This data model separates concerns cleanly:
- `rag_sources` defines *policy* (what/how to ingest)
- `rag_documents` defines canonical *identity and refetch pointer*
- `rag_chunks` defines retrieval *units*
- `rag_fts_chunks` defines keyword search
- `rag_vec_chunks` defines semantic search
This separation makes the system:
- general purpose (works for many schemas)
- deterministic (no magic inference)
- extensible to incremental sync, external indexes, and richer hybrid retrieval

@ -0,0 +1,344 @@
# ProxySQL RAG Engine — Runtime Retrieval Architecture (v0 Blueprint)
This document describes how ProxySQL becomes a **RAG retrieval engine** at runtime. The companion document (Data Model & Ingestion) explains how content enters the SQLite index. This document explains how content is **queried**, how results are **returned to agents/applications**, and how **hybrid retrieval** works in practice.
It is written as an implementation blueprint for ProxySQL (and its MCP server) and assumes the SQLite schema contains:
- `rag_sources` (control plane)
- `rag_documents` (canonical docs)
- `rag_chunks` (retrieval units)
- `rag_fts_chunks` (FTS5)
- `rag_vec_chunks` (sqlite3-vec vectors)
---
## 1. The runtime role of ProxySQL in a RAG system
ProxySQL becomes a RAG runtime by providing four capabilities in one bounded service:
1. **Retrieval Index Host**
- Hosts the SQLite index and search primitives (FTS + vectors).
- Offers deterministic query semantics and strict budgets.
2. **Orchestration Layer**
- Implements search flows (FTS, vector, hybrid, rerank).
- Applies filters, caps, and result shaping.
3. **Stable API Surface (MCP-first)**
- LLM agents call MCP tools (not raw SQL).
- Tool contracts remain stable even if internal storage changes.
4. **Authoritative Row Refetch Gateway**
- After retrieval returns `doc_id` / `pk_json`, ProxySQL can refetch the authoritative row from the source DB on-demand (optional).
- This avoids returning stale or partial data when the full row is needed.
In production terms, this is not “ProxySQL as a general search engine.” It is a **bounded retrieval service** colocated with database access logic.
---
## 2. High-level query flow (agent-centric)
A typical RAG flow has two phases:
### Phase A — Retrieval (fast, bounded, cheap)
- Query the index to obtain a small number of relevant chunks (and their parent doc identity).
- Output includes `chunk_id`, `doc_id`, `score`, and small metadata.
### Phase B — Fetch (optional, authoritative, bounded)
- If the agent needs full context or structured fields, it refetches the authoritative row from the source DB using `pk_json`.
- This avoids scanning large tables and avoids shipping huge payloads in Phase A.
**Canonical flow**
1. `rag.search_hybrid(query, filters, k)` → returns top chunk ids and scores
2. `rag.get_chunks(chunk_ids)` → returns chunk text for prompt grounding/citations
3. Optional: `rag.fetch_from_source(doc_id)` → returns full row or selected columns
---
## 3. Runtime interfaces: MCP vs SQL
ProxySQL should support two “consumption modes”:
### 3.1 MCP tools (preferred for AI agents)
- Strict limits and predictable response schemas.
- Tools return structured results and avoid SQL injection concerns.
- Agents do not need direct DB access.
### 3.2 SQL access (for standard applications / debugging)
- Applications may connect to ProxySQLs SQLite admin interface (or a dedicated port) and issue SQL.
- Useful for:
- internal dashboards
- troubleshooting
- non-agent apps that want retrieval but speak SQL
**Principle**
- MCP is the stable, long-term interface.
- SQL is optional and may be restricted to trusted callers.
---
## 4. Retrieval primitives
### 4.1 FTS retrieval (keyword / exact match)
FTS5 is used for:
- error messages
- identifiers and function names
- tags and exact terms
- “grep-like” queries
**Typical output**
- `chunk_id`, `score_fts`, optional highlights/snippets
**Ranking**
- `bm25(rag_fts_chunks)` is the default. It is fast and effective for term queries.
### 4.2 Vector retrieval (semantic similarity)
Vector search is used for:
- paraphrased questions
- semantic similarity (“how to do X” vs “best way to achieve X”)
- conceptual matching that is poor with keyword-only search
**Typical output**
- `chunk_id`, `score_vec` (distance/similarity), plus join metadata
**Important**
- Vectors are generally computed per chunk.
- Filters are applied via `source_id` and joins to `rag_chunks` / `rag_documents`.
---
## 5. Hybrid retrieval patterns (two recommended modes)
Hybrid retrieval combines FTS and vector search for better quality than either alone. Two concrete modes should be implemented because they solve different problems.
### Mode 1 — “Best of both” (parallel FTS + vector; fuse results)
**Use when**
- the query may contain both exact tokens (e.g. error messages) and semantic intent
**Flow**
1. Run FTS top-N (e.g. N=50)
2. Run vector top-N (e.g. N=50)
3. Merge results by `chunk_id`
4. Score fusion (recommended): Reciprocal Rank Fusion (RRF)
5. Return top-k (e.g. k=10)
**Why RRF**
- Robust without score calibration
- Works across heterogeneous score ranges (bm25 vs cosine distance)
**RRF formula**
- For each candidate chunk:
- `score = w_fts/(k0 + rank_fts) + w_vec/(k0 + rank_vec)`
- Typical: `k0=60`, `w_fts=1.0`, `w_vec=1.0`
### Mode 2 — “Broad FTS then vector refine” (candidate generation + rerank)
**Use when**
- you want strong precision anchored to exact term matches
- you want to avoid vector search over the entire corpus
**Flow**
1. Run broad FTS query top-M (e.g. M=200)
2. Fetch chunk texts for those candidates
3. Compute vector similarity of query embedding to candidate embeddings
4. Return top-k
This mode behaves like a two-stage retrieval pipeline:
- Stage 1: cheap recall (FTS)
- Stage 2: precise semantic rerank within candidates
---
## 6. Filters, constraints, and budgets (blast-radius control)
A RAG retrieval engine must be bounded. ProxySQL should enforce limits at the MCP layer and ideally also at SQL helper functions.
### 6.1 Hard caps (recommended defaults)
- Maximum `k` returned: 50
- Maximum candidates for broad-stage: 200500
- Maximum query length: e.g. 28 KB
- Maximum response bytes: e.g. 15 MB
- Maximum execution time per request: e.g. 50250 ms for retrieval, 12 s for fetch
### 6.2 Filter semantics
Filters should be applied consistently across retrieval modes.
Common filters:
- `source_id` or `source_name`
- tag include/exclude (via metadata_json parsing or pre-extracted tag fields later)
- post type (question vs answer)
- minimum score
- time range (creation date / last activity)
Implementation note:
- v0 stores metadata in JSON; filtering can be implemented in MCP layer or via SQLite JSON functions (if enabled).
- For performance, later versions should denormalize key metadata into dedicated columns or side tables.
---
## 7. Result shaping and what the caller receives
A retrieval response must be designed for downstream LLM usage:
### 7.1 Retrieval results (Phase A)
Return a compact list of “evidence candidates”:
- `chunk_id`
- `doc_id`
- `scores` (fts, vec, fused)
- short `title`
- minimal metadata (source, tags, timestamp, etc.)
Do **not** return full bodies by default; that is what `rag.get_chunks` is for.
### 7.2 Chunk fetch results (Phase A.2)
`rag.get_chunks(chunk_ids)` returns:
- `chunk_id`, `doc_id`
- `title`
- `body` (chunk text)
- optionally a snippet/highlight for display
### 7.3 Source refetch results (Phase B)
`rag.fetch_from_source(doc_id)` returns:
- either the full row
- or a selected subset of columns (recommended)
This is the “authoritative fetch” boundary that prevents stale/partial index usage from being a correctness problem.
---
## 8. SQL examples (runtime extraction)
These are not the preferred agent interface, but they are crucial for debugging and for SQL-native apps.
### 8.1 FTS search (top 10)
```sql
SELECT
f.chunk_id,
bm25(rag_fts_chunks) AS score_fts
FROM rag_fts_chunks f
WHERE rag_fts_chunks MATCH 'json_extract mysql'
ORDER BY score_fts
LIMIT 10;
```
Join to fetch text:
```sql
SELECT
f.chunk_id,
bm25(rag_fts_chunks) AS score_fts,
c.doc_id,
c.body
FROM rag_fts_chunks f
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
WHERE rag_fts_chunks MATCH 'json_extract mysql'
ORDER BY score_fts
LIMIT 10;
```
### 8.2 Vector search (top 10)
Vector syntax depends on how you expose query vectors. A typical pattern is:
1) Bind a query vector into a function / parameter
2) Use `rag_vec_chunks` to return nearest neighbors
Example shape (conceptual):
```sql
-- Pseudocode: nearest neighbors for :query_embedding
SELECT
v.chunk_id,
v.distance
FROM rag_vec_chunks v
WHERE v.embedding MATCH :query_embedding
ORDER BY v.distance
LIMIT 10;
```
In production, ProxySQL MCP will typically compute the query embedding and call SQL internally with a bound parameter.
---
## 9. MCP tools (runtime API surface)
This document does not define full schemas (that is in `mcp-tools.md`), but it defines what each tool must do.
### 9.1 Retrieval
- `rag.search_fts(query, filters, k)`
- `rag.search_vector(query_text | query_embedding, filters, k)`
- `rag.search_hybrid(query, mode, filters, k, params)`
- Mode 1: parallel + RRF fuse
- Mode 2: broad FTS candidates + vector rerank
### 9.2 Fetch
- `rag.get_chunks(chunk_ids)`
- `rag.get_docs(doc_ids)`
- `rag.fetch_from_source(doc_ids | pk_json, columns?, limits?)`
**MCP-first principle**
- Agents do not see SQLite schema or SQL.
- MCP tools remain stable even if you move index storage out of ProxySQL later.
---
## 10. Operational considerations
### 10.1 Dedicated ProxySQL instance
Run GenAI retrieval in a dedicated ProxySQL instance to reduce blast radius:
- independent CPU/memory budgets
- independent configuration and rate limits
- independent failure domain
### 10.2 Observability and metrics (minimum)
- count of docs/chunks per source
- query counts by tool and source
- p50/p95 latency for:
- FTS
- vector
- hybrid
- refetch
- dropped/limited requests (rate limit hit, cap exceeded)
- error rate and error categories
### 10.3 Safety controls
- strict upper bounds on `k` and candidate sizes
- strict timeouts
- response size caps
- optional allowlists for sources accessible to agents
- tenant boundaries via filters (strongly recommended for multi-tenant)
---
## 11. Recommended “v0-to-v1” evolution checklist
### v0 (PoC)
- ingestion to docs/chunks
- FTS search
- vector search (if embedding pipeline available)
- simple hybrid search
- chunk fetch
- manual/limited source refetch
### v1 (product hardening)
- incremental sync checkpoints (`rag_sync_state`)
- update detection (hashing/versioning)
- delete handling
- robust hybrid search:
- RRF fuse
- candidate-generation rerank
- stronger filtering semantics (denormalized metadata columns)
- quotas, rate limits, per-source budgets
- full MCP tool contracts + tests
---
## 12. Summary
At runtime, ProxySQL RAG retrieval is implemented as:
- **Index query** (FTS/vector/hybrid) returning a small set of chunk IDs
- **Chunk fetch** returning the text that the LLM will ground on
- Optional **authoritative refetch** from the source DB by primary key
- Strict limits and consistent filtering to keep the service bounded

@ -0,0 +1,353 @@
# ProxySQL RAG Index — Embeddings & Vector Retrieval Design (Chunk-Level) (v0→v1 Blueprint)
This document specifies how embeddings should be produced, stored, updated, and queried for chunk-level vector search in ProxySQLs RAG index. It is intended as an implementation blueprint.
It assumes:
- Chunking is already implemented (`rag_chunks`).
- ProxySQL includes **sqlite3-vec** and uses a `vec0(...)` virtual table (`rag_vec_chunks`).
- Retrieval is exposed primarily via MCP tools (`mcp-tools.md`).
---
## 1. Design objectives
1. **Chunk-level embeddings**
- Each chunk receives its own embedding for retrieval precision.
2. **Deterministic embedding input**
- The text embedded is explicitly defined per source, not inferred.
3. **Model agility**
- The system can change embedding models/dimensions without breaking stored data or APIs.
4. **Efficient updates**
- Only recompute embeddings for chunks whose embedding input changed.
5. **Operational safety**
- Bound cost and latency (embedding generation can be expensive).
- Allow asynchronous embedding jobs if needed later.
---
## 2. What to embed (and what not to embed)
### 2.1 Embed text that improves semantic retrieval
Recommended embedding input per chunk:
- Document title (if present)
- Tags (as plain text)
- Chunk body
Example embedding input template:
```
{Title}
Tags: {Tags}
{ChunkBody}
```
This typically improves semantic recall significantly for knowledge-base-like content (StackOverflow posts, docs, tickets, runbooks).
### 2.2 Do NOT embed numeric metadata by default
Do not embed fields like `Score`, `ViewCount`, `OwnerUserId`, timestamps, etc. These should remain structured and be used for:
- filtering
- boosting
- tie-breaking
- result shaping
Embedding numeric metadata into text typically adds noise and reduces semantic quality.
### 2.3 Code and HTML considerations
If your chunk body contains HTML or code:
- **v0**: embed raw text (works, but may be noisy)
- **v1**: normalize to improve quality:
- strip HTML tags (keep text content)
- preserve code blocks as text, but consider stripping excessive markup
- optionally create specialized “code-only” chunks for code-heavy sources
Normalization should be source-configurable.
---
## 3. Where embedding input rules are defined
Embedding input rules must be explicit and stored per source.
### 3.1 `rag_sources.embedding_json`
Recommended schema:
```json
{
"enabled": true,
"model": "text-embedding-3-large",
"dim": 1536,
"input": {
"concat": [
{"col":"Title"},
{"lit":"\nTags: "}, {"col":"Tags"},
{"lit":"\n\n"},
{"chunk_body": true}
]
},
"normalize": {
"strip_html": true,
"collapse_whitespace": true
}
}
```
**Semantics**
- `enabled`: whether to compute/store embeddings for this source
- `model`: logical name (for observability and compatibility checks)
- `dim`: vector dimension
- `input.concat`: how to build embedding input text
- `normalize`: optional normalization steps
---
## 4. Storage schema and model/versioning
### 4.1 Current v0 schema: single vector table
`rag_vec_chunks` stores:
- embedding vector
- chunk_id
- doc_id/source_id convenience columns
- updated_at
This is appropriate for v0 when you assume a single embedding model/dimension.
### 4.2 Recommended v1 evolution: support multiple models
In a product setting, you may want multiple embedding models (e.g. general vs code-centric).
Two ways to support this:
#### Option A: include model identity columns in `rag_vec_chunks`
Add columns:
- `model TEXT`
- `dim INTEGER` (optional if fixed per model)
Then allow multiple rows per `chunk_id` (unique key becomes `(chunk_id, model)`).
This may require schema change and a different vec0 design (some vec0 configurations support metadata columns, but uniqueness must be handled carefully).
#### Option B: one vec table per model (recommended if vec0 constraints exist)
Create:
- `rag_vec_chunks_1536_v1`
- `rag_vec_chunks_1024_code_v1`
etc.
Then MCP tools select the table based on requested model or default configuration.
**Recommendation**
Start with Option A only if your sqlite3-vec build makes it easy to filter by model. Otherwise, Option B is operationally cleaner.
---
## 5. Embedding generation pipeline
### 5.1 When embeddings are created
Embeddings are created during ingestion, immediately after chunk creation, if `embedding_json.enabled=true`.
This provides a simple, synchronous pipeline:
- ingest row → create chunks → compute embedding → store vector
### 5.2 When embeddings should be updated
Embeddings must be recomputed if the *embedding input string* changes. That depends on:
- title changes
- tags changes
- chunk body changes
- normalization rules changes (strip_html etc.)
- embedding model changes
Therefore, update logic should be based on a **content hash** of the embedding input.
---
## 6. Content hashing for efficient updates (v1 recommendation)
### 6.1 Why hashing is needed
Without hashing, you might recompute embeddings unnecessarily:
- expensive
- slow
- prevents incremental sync from being efficient
### 6.2 Recommended approach
Store `embedding_input_hash` per chunk per model.
Implementation options:
#### Option A: Store hash in `rag_chunks.metadata_json`
Example:
```json
{
"chunk_index": 0,
"embedding_hash": "sha256:...",
"embedding_model": "text-embedding-3-large"
}
```
Pros: no schema changes.
Cons: JSON parsing overhead.
#### Option B: Dedicated side table (recommended)
Create `rag_chunk_embedding_state`:
```sql
CREATE TABLE rag_chunk_embedding_state (
chunk_id TEXT NOT NULL,
model TEXT NOT NULL,
dim INTEGER NOT NULL,
input_hash TEXT NOT NULL,
updated_at INTEGER NOT NULL DEFAULT (unixepoch()),
PRIMARY KEY(chunk_id, model)
);
```
Pros: fast lookups; avoids JSON parsing.
Cons: extra table.
**Recommendation**
Use Option B for v1.
---
## 7. Embedding model integration options
### 7.1 External embedding service (recommended initially)
ProxySQL calls an embedding service:
- OpenAI-compatible endpoint, or
- local service (e.g. llama.cpp server), or
- vendor-specific embedding API
Pros:
- easy to iterate on model choice
- isolates ML runtime from ProxySQL process
Cons:
- network latency; requires caching and timeouts
### 7.2 Embedded model runtime inside ProxySQL
ProxySQL links to an embedding runtime (llama.cpp, etc.)
Pros:
- no network dependency
- predictable latency if tuned
Cons:
- increases memory footprint
- needs careful resource controls
**Recommendation**
Start with an external embedding provider and keep a modular interface that can be swapped later.
---
## 8. Query embedding generation
Vector search needs a query embedding. Do this in the MCP layer:
1. Take `query_text`
2. Apply query normalization (optional but recommended)
3. Compute query embedding using the same model used for chunks
4. Execute vector search SQL with a bound embedding vector
**Do not**
- accept arbitrary embedding vectors from untrusted callers without validation
- allow unbounded query lengths
---
## 9. Vector search semantics
### 9.1 Distance vs similarity
Depending on the embedding model and vec search primitive, vector search may return:
- cosine distance (lower is better)
- cosine similarity (higher is better)
- L2 distance (lower is better)
**Recommendation**
Normalize to a “higher is better” score in MCP responses:
- if distance: `score_vec = 1 / (1 + distance)` or similar monotonic transform
Keep raw distance in debug fields if needed.
### 9.2 Filtering
Filtering should be supported by:
- `source_id` restriction
- optional metadata filters (doc-level or chunk-level)
In v0, filter by `source_id` is easiest because `rag_vec_chunks` stores `source_id` as metadata.
---
## 10. Hybrid retrieval integration
Embeddings are one leg of hybrid retrieval. Two recommended hybrid modes are described in `mcp-tools.md`:
1. **Fuse**: top-N FTS and top-N vector, merged by chunk_id, fused by RRF
2. **FTS then vector**: broad FTS candidates then vector rerank within candidates
Embeddings support both:
- Fuse mode needs global vector search top-N.
- Candidate mode needs vector search restricted to candidate chunk IDs.
Candidate mode is often cheaper and more precise when the query includes strong exact tokens.
---
## 11. Operational controls
### 11.1 Resource limits
Embedding generation must be bounded by:
- max chunk size embedded
- max chunks embedded per document
- per-source embedding rate limit
- timeouts when calling embedding provider
### 11.2 Batch embedding
To improve throughput, embed in batches:
- collect N chunks
- send embedding request for N inputs
- store results
### 11.3 Backpressure and async embedding
For v1, consider decoupling embedding generation from ingestion:
- ingestion stores chunks
- embedding worker processes “pending” chunks and fills vectors
This allows:
- ingestion to remain fast
- embedding to scale independently
- retries on embedding failures
In this design, store a state record:
- pending / ok / error
- last error message
- retry count
---
## 12. Recommended implementation steps (coding agent checklist)
### v0 (synchronous embedding)
1. Implement `embedding_json` parsing in ingester
2. Build embedding input string for each chunk
3. Call embedding provider (or use a stub in development)
4. Insert vector rows into `rag_vec_chunks`
5. Implement `rag.search_vector` MCP tool using query embedding + vector SQL
### v1 (efficient incremental embedding)
1. Add `rag_chunk_embedding_state` table
2. Store `input_hash` per chunk per model
3. Only re-embed if hash changed
4. Add async embedding worker option
5. Add metrics for embedding throughput and failures
---
## 13. Summary
- Compute embeddings per chunk, not per document.
- Define embedding input explicitly in `rag_sources.embedding_json`.
- Store vectors in `rag_vec_chunks` (vec0).
- For production, add hash-based update detection and optional async embedding workers.
- Normalize vector scores in MCP responses and keep raw distance for debugging.

@ -0,0 +1,465 @@
# MCP Tooling for ProxySQL RAG Engine (v0 Blueprint)
This document defines the MCP tool surface for querying ProxySQLs embedded RAG index. It is intended as a stable interface for AI agents. Internally, these tools query the SQLite schema described in `schema.sql` and the retrieval logic described in `architecture-runtime-retrieval.md`.
**Design goals**
- Stable tool contracts (do not break agents when internals change)
- Strict bounds (prevent unbounded scans / large outputs)
- Deterministic schemas (agents can reliably parse outputs)
- Separation of concerns:
- Retrieval returns identifiers and scores
- Fetch returns content
- Optional refetch returns authoritative source rows
---
## 1. Conventions
### 1.1 Identifiers
- `doc_id`: stable document identifier (e.g. `posts:12345`)
- `chunk_id`: stable chunk identifier (e.g. `posts:12345#0`)
- `source_id` / `source_name`: corresponds to `rag_sources`
### 1.2 Scores
- FTS score: `score_fts` (bm25; lower is better in SQLites bm25 by default)
- Vector score: `score_vec` (distance or similarity, depending on implementation)
- Hybrid score: `score` (normalized fused score; higher is better)
**Recommendation**
Normalize scores in MCP layer so:
- higher is always better for agent ranking
- raw internal ranking can still be returned as `score_fts_raw`, `distance_raw`, etc. if helpful
### 1.3 Limits and budgets (recommended defaults)
All tools should enforce caps, regardless of caller input:
- `k_max = 50`
- `candidates_max = 500`
- `query_max_bytes = 8192`
- `response_max_bytes = 5_000_000`
- `timeout_ms` (per tool): 2502000ms depending on tool type
Tools must return a `truncated` boolean if limits reduce output.
---
## 2. Shared filter model
Many tools accept the same filter structure. This is intentionally simple in v0.
### 2.1 Filter object
```json
{
"source_ids": [1,2],
"source_names": ["stack_posts"],
"doc_ids": ["posts:12345"],
"min_score": 5,
"post_type_ids": [1],
"tags_any": ["mysql","json"],
"tags_all": ["mysql","json"],
"created_after": "2022-01-01T00:00:00Z",
"created_before": "2025-01-01T00:00:00Z"
}
```
**Notes**
- In v0, most filters map to `metadata_json` values. Implementation can:
- filter in SQLite if JSON functions are available, or
- filter in MCP layer after initial retrieval (acceptable for small k/candidates)
- For production, denormalize hot filters into dedicated columns for speed.
### 2.2 Filter behavior
- If both `source_ids` and `source_names` are provided, treat as intersection.
- If no source filter is provided, default to all enabled sources **but** enforce a strict global budget.
---
## 3. Tool: `rag.search_fts`
Keyword search over `rag_fts_chunks`.
### 3.1 Request schema
```json
{
"query": "json_extract mysql",
"k": 10,
"offset": 0,
"filters": { },
"return": {
"include_title": true,
"include_metadata": true,
"include_snippets": false
}
}
```
### 3.2 Semantics
- Executes FTS query (MATCH) over indexed content.
- Returns top-k chunk matches with scores and identifiers.
- Does not return full chunk bodies unless `include_snippets` is requested (still bounded).
### 3.3 Response schema
```json
{
"results": [
{
"chunk_id": "posts:12345#0",
"doc_id": "posts:12345",
"source_id": 1,
"source_name": "stack_posts",
"score_fts": 0.73,
"title": "How to parse JSON in MySQL 8?",
"metadata": { "Tags": "<mysql><json>", "Score": "12" }
}
],
"truncated": false,
"stats": {
"k_requested": 10,
"k_returned": 10,
"ms": 12
}
}
```
---
## 4. Tool: `rag.search_vector`
Semantic search over `rag_vec_chunks`.
### 4.1 Request schema (text input)
```json
{
"query_text": "How do I extract JSON fields in MySQL?",
"k": 10,
"filters": { },
"embedding": {
"model": "text-embedding-3-large"
}
}
```
### 4.2 Request schema (precomputed vector)
```json
{
"query_embedding": {
"dim": 1536,
"values_b64": "AAAA..." // float32 array packed and base64 encoded
},
"k": 10,
"filters": { }
}
```
### 4.3 Semantics
- If `query_text` is provided, ProxySQL computes embedding internally (preferred for agents).
- If `query_embedding` is provided, ProxySQL uses it directly (useful for advanced clients).
- Returns nearest chunks by distance/similarity.
### 4.4 Response schema
```json
{
"results": [
{
"chunk_id": "posts:9876#1",
"doc_id": "posts:9876",
"source_id": 1,
"source_name": "stack_posts",
"score_vec": 0.82,
"title": "Query JSON columns efficiently",
"metadata": { "Tags": "<mysql><json>", "Score": "8" }
}
],
"truncated": false,
"stats": {
"k_requested": 10,
"k_returned": 10,
"ms": 18
}
}
```
---
## 5. Tool: `rag.search_hybrid`
Hybrid search combining FTS and vectors. Supports two modes:
- **Mode A**: parallel FTS + vector, fuse results (RRF recommended)
- **Mode B**: broad FTS candidate generation, then vector rerank
### 5.1 Request schema (Mode A: fuse)
```json
{
"query": "json_extract mysql",
"k": 10,
"filters": { },
"mode": "fuse",
"fuse": {
"fts_k": 50,
"vec_k": 50,
"rrf_k0": 60,
"w_fts": 1.0,
"w_vec": 1.0
}
}
```
### 5.2 Request schema (Mode B: candidates + rerank)
```json
{
"query": "json_extract mysql",
"k": 10,
"filters": { },
"mode": "fts_then_vec",
"fts_then_vec": {
"candidates_k": 200,
"rerank_k": 50,
"vec_metric": "cosine"
}
}
```
### 5.3 Semantics (Mode A)
1. Run FTS top `fts_k`
2. Run vector top `vec_k`
3. Merge candidates by `chunk_id`
4. Compute fused score (RRF recommended)
5. Return top `k`
### 5.4 Semantics (Mode B)
1. Run FTS top `candidates_k`
2. Compute vector similarity within those candidates
- either by joining candidate chunk_ids to stored vectors, or
- by embedding candidate chunk text on the fly (not recommended)
3. Return top `k` reranked results
4. Optionally return debug info about candidate stages
### 5.5 Response schema
```json
{
"results": [
{
"chunk_id": "posts:12345#0",
"doc_id": "posts:12345",
"source_id": 1,
"source_name": "stack_posts",
"score": 0.91,
"score_fts": 0.74,
"score_vec": 0.86,
"title": "How to parse JSON in MySQL 8?",
"metadata": { "Tags": "<mysql><json>", "Score": "12" },
"debug": {
"rank_fts": 3,
"rank_vec": 6
}
}
],
"truncated": false,
"stats": {
"mode": "fuse",
"k_requested": 10,
"k_returned": 10,
"ms": 27
}
}
```
---
## 6. Tool: `rag.get_chunks`
Fetch chunk bodies by chunk_id. This is how agents obtain grounding text.
### 6.1 Request schema
```json
{
"chunk_ids": ["posts:12345#0", "posts:9876#1"],
"return": {
"include_title": true,
"include_doc_metadata": true,
"include_chunk_metadata": true
}
}
```
### 6.2 Response schema
```json
{
"chunks": [
{
"chunk_id": "posts:12345#0",
"doc_id": "posts:12345",
"title": "How to parse JSON in MySQL 8?",
"body": "<p>I tried JSON_EXTRACT...</p>",
"doc_metadata": { "Tags": "<mysql><json>", "Score": "12" },
"chunk_metadata": { "chunk_index": 0 }
}
],
"truncated": false,
"stats": { "ms": 6 }
}
```
**Hard limit recommendation**
- Cap total returned chunk bytes to a safe maximum (e.g. 12 MB).
---
## 7. Tool: `rag.get_docs`
Fetch full canonical documents by doc_id (not chunks). Useful for inspection or compact docs.
### 7.1 Request schema
```json
{
"doc_ids": ["posts:12345"],
"return": {
"include_body": true,
"include_metadata": true
}
}
```
### 7.2 Response schema
```json
{
"docs": [
{
"doc_id": "posts:12345",
"source_id": 1,
"source_name": "stack_posts",
"pk_json": { "Id": 12345 },
"title": "How to parse JSON in MySQL 8?",
"body": "<p>...</p>",
"metadata": { "Tags": "<mysql><json>", "Score": "12" }
}
],
"truncated": false,
"stats": { "ms": 7 }
}
```
---
## 8. Tool: `rag.fetch_from_source`
Refetch authoritative rows from the source DB using `doc_id` (via pk_json).
### 8.1 Request schema
```json
{
"doc_ids": ["posts:12345"],
"columns": ["Id","Title","Body","Tags","Score"],
"limits": {
"max_rows": 10,
"max_bytes": 200000
}
}
```
### 8.2 Semantics
- Look up doc(s) in `rag_documents` to get `source_id` and `pk_json`
- Resolve source connection from `rag_sources`
- Execute a parameterized query by primary key
- Return requested columns only
- Enforce strict limits
### 8.3 Response schema
```json
{
"rows": [
{
"doc_id": "posts:12345",
"source_name": "stack_posts",
"row": {
"Id": 12345,
"Title": "How to parse JSON in MySQL 8?",
"Score": 12
}
}
],
"truncated": false,
"stats": { "ms": 22 }
}
```
**Security note**
- This tool must not allow arbitrary SQL.
- Only allow fetching by primary key and a whitelist of columns.
---
## 9. Tool: `rag.admin.stats` (recommended)
Operational visibility for dashboards and debugging.
### 9.1 Request
```json
{}
```
### 9.2 Response
```json
{
"sources": [
{
"source_id": 1,
"source_name": "stack_posts",
"docs": 123456,
"chunks": 456789,
"last_sync": null
}
],
"stats": { "ms": 5 }
}
```
---
## 10. Tool: `rag.admin.sync` (optional in v0; required in v1)
Kicks ingestion for a source or all sources. In v0, ingestion may run as a separate process; in ProxySQL product form, this would trigger an internal job.
### 10.1 Request
```json
{
"source_names": ["stack_posts"]
}
```
### 10.2 Response
```json
{
"accepted": true,
"job_id": "sync-2026-01-19T10:00:00Z"
}
```
---
## 11. Implementation notes (what the coding agent should implement)
1. **Input validation and caps** for every tool.
2. **Consistent filtering** across FTS/vector/hybrid.
3. **Stable scoring semantics** (higher-is-better recommended).
4. **Efficient joins**:
- vector search returns chunk_ids; join to `rag_chunks`/`rag_documents` for metadata.
5. **Hybrid modes**:
- Mode A (fuse): implement RRF
- Mode B (fts_then_vec): candidate set then vector rerank
6. **Error model**:
- return structured errors with codes (e.g. `INVALID_ARGUMENT`, `LIMIT_EXCEEDED`, `INTERNAL`)
7. **Observability**:
- return `stats.ms` in responses
- track tool usage counters and latency histograms
---
## 12. Summary
These MCP tools define a stable retrieval interface:
- Search: `rag.search_fts`, `rag.search_vector`, `rag.search_hybrid`
- Fetch: `rag.get_chunks`, `rag.get_docs`, `rag.fetch_from_source`
- Admin: `rag.admin.stats`, optionally `rag.admin.sync`

File diff suppressed because it is too large Load Diff

@ -0,0 +1,172 @@
-- ============================================================
-- ProxySQL RAG Index Schema (SQLite)
-- v0: documents + chunks + FTS5 + sqlite3-vec embeddings
-- ============================================================
PRAGMA foreign_keys = ON;
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
-- ============================================================
-- 1) rag_sources: control plane
-- Defines where to fetch from + how to transform + chunking.
-- ============================================================
CREATE TABLE IF NOT EXISTS rag_sources (
source_id INTEGER PRIMARY KEY,
name TEXT NOT NULL UNIQUE, -- e.g. "stack_posts"
enabled INTEGER NOT NULL DEFAULT 1,
-- Where to retrieve from (PoC: connect directly; later can be "via ProxySQL")
backend_type TEXT NOT NULL, -- "mysql" | "postgres" | ...
backend_host TEXT NOT NULL,
backend_port INTEGER NOT NULL,
backend_user TEXT NOT NULL,
backend_pass TEXT NOT NULL,
backend_db TEXT NOT NULL, -- database/schema name
table_name TEXT NOT NULL, -- e.g. "posts"
pk_column TEXT NOT NULL, -- e.g. "Id"
-- Optional: restrict ingestion; appended to SELECT as WHERE <where_sql>
where_sql TEXT, -- e.g. "PostTypeId IN (1,2)"
-- REQUIRED: mapping from source row -> rag_documents fields
-- JSON spec describing doc_id, title/body concat, metadata pick/rename, etc.
doc_map_json TEXT NOT NULL,
-- REQUIRED: chunking strategy (enabled, chunk_size, overlap, etc.)
chunking_json TEXT NOT NULL,
-- Optional: embedding strategy (how to build embedding input text)
-- In v0 you can keep it NULL/empty; define later without schema changes.
embedding_json TEXT,
created_at INTEGER NOT NULL DEFAULT (unixepoch()),
updated_at INTEGER NOT NULL DEFAULT (unixepoch())
);
CREATE INDEX IF NOT EXISTS idx_rag_sources_enabled
ON rag_sources(enabled);
CREATE INDEX IF NOT EXISTS idx_rag_sources_backend
ON rag_sources(backend_type, backend_host, backend_port, backend_db, table_name);
-- ============================================================
-- 2) rag_documents: canonical documents
-- One document per source row (e.g. one per posts.Id).
-- ============================================================
CREATE TABLE IF NOT EXISTS rag_documents (
doc_id TEXT PRIMARY KEY, -- stable: e.g. "posts:12345"
source_id INTEGER NOT NULL REFERENCES rag_sources(source_id),
source_name TEXT NOT NULL, -- copy of rag_sources.name for convenience
pk_json TEXT NOT NULL, -- e.g. {"Id":12345}
title TEXT,
body TEXT,
metadata_json TEXT NOT NULL DEFAULT '{}', -- JSON object
updated_at INTEGER NOT NULL DEFAULT (unixepoch()),
deleted INTEGER NOT NULL DEFAULT 0
);
CREATE INDEX IF NOT EXISTS idx_rag_documents_source_updated
ON rag_documents(source_id, updated_at);
CREATE INDEX IF NOT EXISTS idx_rag_documents_source_deleted
ON rag_documents(source_id, deleted);
-- ============================================================
-- 3) rag_chunks: chunked content
-- The unit we index in FTS and vectors.
-- ============================================================
CREATE TABLE IF NOT EXISTS rag_chunks (
chunk_id TEXT PRIMARY KEY, -- e.g. "posts:12345#0"
doc_id TEXT NOT NULL REFERENCES rag_documents(doc_id),
source_id INTEGER NOT NULL REFERENCES rag_sources(source_id),
chunk_index INTEGER NOT NULL, -- 0..N-1
title TEXT,
body TEXT NOT NULL,
-- Optional per-chunk metadata (e.g. offsets, has_code, section label)
metadata_json TEXT NOT NULL DEFAULT '{}',
updated_at INTEGER NOT NULL DEFAULT (unixepoch()),
deleted INTEGER NOT NULL DEFAULT 0
);
CREATE UNIQUE INDEX IF NOT EXISTS uq_rag_chunks_doc_idx
ON rag_chunks(doc_id, chunk_index);
CREATE INDEX IF NOT EXISTS idx_rag_chunks_source_doc
ON rag_chunks(source_id, doc_id);
CREATE INDEX IF NOT EXISTS idx_rag_chunks_deleted
ON rag_chunks(deleted);
-- ============================================================
-- 4) rag_fts_chunks: FTS5 index (contentless)
-- Maintained explicitly by the ingester.
-- Notes:
-- - chunk_id is stored but UNINDEXED.
-- - Use bm25(rag_fts_chunks) for ranking.
-- ============================================================
CREATE VIRTUAL TABLE IF NOT EXISTS rag_fts_chunks
USING fts5(
chunk_id UNINDEXED,
title,
body,
tokenize = 'unicode61'
);
-- ============================================================
-- 5) rag_vec_chunks: sqlite3-vec index
-- Stores embeddings per chunk for vector search.
--
-- IMPORTANT:
-- - dimension must match your embedding model (example: 1536).
-- - metadata columns are included to help join/filter.
-- ============================================================
CREATE VIRTUAL TABLE IF NOT EXISTS rag_vec_chunks
USING vec0(
embedding float[1536], -- change if you use another dimension
chunk_id TEXT, -- join key back to rag_chunks
doc_id TEXT, -- optional convenience
source_id INTEGER, -- optional convenience
updated_at INTEGER -- optional convenience
);
-- Optional: convenience view for debugging / SQL access patterns
CREATE VIEW IF NOT EXISTS rag_chunk_view AS
SELECT
c.chunk_id,
c.doc_id,
c.source_id,
d.source_name,
d.pk_json,
COALESCE(c.title, d.title) AS title,
c.body,
d.metadata_json AS doc_metadata_json,
c.metadata_json AS chunk_metadata_json,
c.updated_at
FROM rag_chunks c
JOIN rag_documents d ON d.doc_id = c.doc_id
WHERE c.deleted = 0 AND d.deleted = 0;
-- ============================================================
-- 6) (Optional) sync state placeholder for later incremental ingestion
-- Not used in v0, but reserving it avoids later schema churn.
-- ============================================================
CREATE TABLE IF NOT EXISTS rag_sync_state (
source_id INTEGER PRIMARY KEY REFERENCES rag_sources(source_id),
mode TEXT NOT NULL DEFAULT 'poll', -- 'poll' | 'cdc'
cursor_json TEXT NOT NULL DEFAULT '{}', -- watermark/checkpoint
last_ok_at INTEGER,
last_error TEXT
);

@ -0,0 +1,348 @@
# ProxySQL RAG Index — SQL Examples (FTS, Vectors, Hybrid)
This file provides concrete SQL examples for querying the ProxySQL-hosted SQLite RAG index directly (for debugging, internal dashboards, or SQL-native applications).
The **preferred interface for AI agents** remains MCP tools (`mcp-tools.md`). SQL access should typically be restricted to trusted callers.
Assumed tables:
- `rag_documents`
- `rag_chunks`
- `rag_fts_chunks` (FTS5)
- `rag_vec_chunks` (sqlite3-vec vec0 table)
---
## 0. Common joins and inspection
### 0.1 Inspect one document and its chunks
```sql
SELECT * FROM rag_documents WHERE doc_id = 'posts:12345';
SELECT * FROM rag_chunks WHERE doc_id = 'posts:12345' ORDER BY chunk_index;
```
### 0.2 Use the convenience view (if enabled)
```sql
SELECT * FROM rag_chunk_view WHERE doc_id = 'posts:12345' ORDER BY chunk_id;
```
---
## 1. FTS5 examples
### 1.1 Basic FTS search (top 10)
```sql
SELECT
f.chunk_id,
bm25(rag_fts_chunks) AS score_fts_raw
FROM rag_fts_chunks f
WHERE rag_fts_chunks MATCH 'json_extract mysql'
ORDER BY score_fts_raw
LIMIT 10;
```
### 1.2 Join FTS results to chunk text and document metadata
```sql
SELECT
f.chunk_id,
bm25(rag_fts_chunks) AS score_fts_raw,
c.doc_id,
COALESCE(c.title, d.title) AS title,
c.body AS chunk_body,
d.metadata_json AS doc_metadata_json
FROM rag_fts_chunks f
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
JOIN rag_documents d ON d.doc_id = c.doc_id
WHERE rag_fts_chunks MATCH 'json_extract mysql'
AND c.deleted = 0 AND d.deleted = 0
ORDER BY score_fts_raw
LIMIT 10;
```
### 1.3 Apply a source filter (by source_id)
```sql
SELECT
f.chunk_id,
bm25(rag_fts_chunks) AS score_fts_raw
FROM rag_fts_chunks f
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
WHERE rag_fts_chunks MATCH 'replication lag'
AND c.source_id = 1
ORDER BY score_fts_raw
LIMIT 20;
```
### 1.4 Phrase queries, boolean operators (FTS5)
```sql
-- phrase
SELECT chunk_id FROM rag_fts_chunks
WHERE rag_fts_chunks MATCH '"group replication"'
LIMIT 20;
-- boolean: term1 AND term2
SELECT chunk_id FROM rag_fts_chunks
WHERE rag_fts_chunks MATCH 'mysql AND deadlock'
LIMIT 20;
-- boolean: term1 NOT term2
SELECT chunk_id FROM rag_fts_chunks
WHERE rag_fts_chunks MATCH 'mysql NOT mariadb'
LIMIT 20;
```
---
## 2. Vector search examples (sqlite3-vec)
Vector SQL varies slightly depending on sqlite3-vec build and how you bind vectors.
Below are **two patterns** you can implement in ProxySQL.
### 2.1 Pattern A (recommended): ProxySQL computes embeddings; SQL receives a bound vector
In this pattern, ProxySQL:
1) Computes the query embedding in C++
2) Executes SQL with a bound parameter `:qvec` representing the embedding
A typical “nearest neighbors” query shape is:
```sql
-- PSEUDOCODE: adapt to sqlite3-vec's exact operator/function in your build.
SELECT
v.chunk_id,
v.distance AS distance_raw
FROM rag_vec_chunks v
WHERE v.embedding MATCH :qvec
ORDER BY distance_raw
LIMIT 10;
```
Then join to chunks:
```sql
-- PSEUDOCODE: join with content and metadata
SELECT
v.chunk_id,
v.distance AS distance_raw,
c.doc_id,
c.body AS chunk_body,
d.metadata_json AS doc_metadata_json
FROM (
SELECT chunk_id, distance
FROM rag_vec_chunks
WHERE embedding MATCH :qvec
ORDER BY distance
LIMIT 10
) v
JOIN rag_chunks c ON c.chunk_id = v.chunk_id
JOIN rag_documents d ON d.doc_id = c.doc_id;
```
### 2.2 Pattern B (debug): store a query vector in a temporary table
This is useful when you want to run vector queries manually in SQL without MCP support.
```sql
CREATE TEMP TABLE tmp_query_vec(qvec BLOB);
-- Insert the query vector (float32 array blob). The insertion is usually done by tooling, not manually.
-- INSERT INTO tmp_query_vec VALUES (X'...');
-- PSEUDOCODE: use tmp_query_vec.qvec as the query embedding
SELECT
v.chunk_id,
v.distance
FROM rag_vec_chunks v, tmp_query_vec t
WHERE v.embedding MATCH t.qvec
ORDER BY v.distance
LIMIT 10;
```
---
## 3. Hybrid search examples
Hybrid retrieval is best implemented in the MCP layer because it mixes ranking systems and needs careful bounding.
However, you can approximate hybrid behavior using SQL to validate logic.
### 3.1 Hybrid Mode A: Parallel FTS + Vector then fuse (RRF)
#### Step 1: FTS top 50 (ranked)
```sql
WITH fts AS (
SELECT
f.chunk_id,
bm25(rag_fts_chunks) AS score_fts_raw
FROM rag_fts_chunks f
WHERE rag_fts_chunks MATCH :fts_query
ORDER BY score_fts_raw
LIMIT 50
)
SELECT * FROM fts;
```
#### Step 2: Vector top 50 (ranked)
```sql
WITH vec AS (
SELECT
v.chunk_id,
v.distance AS distance_raw
FROM rag_vec_chunks v
WHERE v.embedding MATCH :qvec
ORDER BY v.distance
LIMIT 50
)
SELECT * FROM vec;
```
#### Step 3: Fuse via Reciprocal Rank Fusion (RRF)
In SQL you need ranks. SQLite supports window functions in modern builds.
```sql
WITH
fts AS (
SELECT
f.chunk_id,
bm25(rag_fts_chunks) AS score_fts_raw,
ROW_NUMBER() OVER (ORDER BY bm25(rag_fts_chunks)) AS rank_fts
FROM rag_fts_chunks f
WHERE rag_fts_chunks MATCH :fts_query
LIMIT 50
),
vec AS (
SELECT
v.chunk_id,
v.distance AS distance_raw,
ROW_NUMBER() OVER (ORDER BY v.distance) AS rank_vec
FROM rag_vec_chunks v
WHERE v.embedding MATCH :qvec
LIMIT 50
),
merged AS (
SELECT
COALESCE(fts.chunk_id, vec.chunk_id) AS chunk_id,
fts.rank_fts,
vec.rank_vec,
fts.score_fts_raw,
vec.distance_raw
FROM fts
FULL OUTER JOIN vec ON vec.chunk_id = fts.chunk_id
),
rrf AS (
SELECT
chunk_id,
score_fts_raw,
distance_raw,
rank_fts,
rank_vec,
(1.0 / (60.0 + COALESCE(rank_fts, 1000000))) +
(1.0 / (60.0 + COALESCE(rank_vec, 1000000))) AS score_rrf
FROM merged
)
SELECT
r.chunk_id,
r.score_rrf,
c.doc_id,
c.body AS chunk_body
FROM rrf r
JOIN rag_chunks c ON c.chunk_id = r.chunk_id
ORDER BY r.score_rrf DESC
LIMIT 10;
```
**Important**: SQLite does not support `FULL OUTER JOIN` directly in all builds.
For production, implement the merge/fuse in C++ (MCP layer). This SQL is illustrative.
### 3.2 Hybrid Mode B: Broad FTS then vector rerank (candidate generation)
#### Step 1: FTS candidate set (top 200)
```sql
WITH candidates AS (
SELECT
f.chunk_id,
bm25(rag_fts_chunks) AS score_fts_raw
FROM rag_fts_chunks f
WHERE rag_fts_chunks MATCH :fts_query
ORDER BY score_fts_raw
LIMIT 200
)
SELECT * FROM candidates;
```
#### Step 2: Vector rerank within candidates
Conceptually:
- Join candidates to `rag_vec_chunks` and compute distance to `:qvec`
- Keep top 10
```sql
WITH candidates AS (
SELECT
f.chunk_id
FROM rag_fts_chunks f
WHERE rag_fts_chunks MATCH :fts_query
ORDER BY bm25(rag_fts_chunks)
LIMIT 200
),
reranked AS (
SELECT
v.chunk_id,
v.distance AS distance_raw
FROM rag_vec_chunks v
JOIN candidates c ON c.chunk_id = v.chunk_id
WHERE v.embedding MATCH :qvec
ORDER BY v.distance
LIMIT 10
)
SELECT
r.chunk_id,
r.distance_raw,
ch.doc_id,
ch.body
FROM reranked r
JOIN rag_chunks ch ON ch.chunk_id = r.chunk_id;
```
As above, the exact `MATCH :qvec` syntax may need adaptation to your sqlite3-vec build; implement vector query execution in C++ and keep SQL as internal glue.
---
## 4. Common “application-friendly” queries
### 4.1 Return doc_id + score + title only (no bodies)
```sql
SELECT
f.chunk_id,
c.doc_id,
COALESCE(c.title, d.title) AS title,
bm25(rag_fts_chunks) AS score_fts_raw
FROM rag_fts_chunks f
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
JOIN rag_documents d ON d.doc_id = c.doc_id
WHERE rag_fts_chunks MATCH :q
ORDER BY score_fts_raw
LIMIT 20;
```
### 4.2 Return top doc_ids (deduplicate by doc_id)
```sql
WITH ranked_chunks AS (
SELECT
c.doc_id,
bm25(rag_fts_chunks) AS score_fts_raw
FROM rag_fts_chunks f
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
WHERE rag_fts_chunks MATCH :q
ORDER BY score_fts_raw
LIMIT 200
)
SELECT doc_id, MIN(score_fts_raw) AS best_score
FROM ranked_chunks
GROUP BY doc_id
ORDER BY best_score
LIMIT 20;
```
---
## 5. Practical guidance
- Use SQL mode mainly for debugging and internal tooling.
- Prefer MCP tools for agent interaction:
- stable schemas
- strong guardrails
- consistent hybrid scoring
- Implement hybrid fusion in C++ (not in SQL) to avoid dialect limitations and to keep scoring correct.
Loading…
Cancel
Save