Add RAG capability blueprint documents

These documents serve as blueprints for implementing RAG (Retrieval-Augmented Generation) capabilities in ProxySQL: - schema.sql: Database schema for RAG implementation - rag_ingest.cpp: PoC ingester blueprint to be integrated into ProxySQL - architecture-data-model.md: Data model architecture for RAG - architecture-runtime-retrieval.md: Runtime retrieval architecture - mcp-tools.md: MCP tools integration design - sql-examples.md: SQL usage examples for RAG - embeddings-design.md: Embeddings design for vector search These files will guide the upcoming RAG implementation in ProxySQL.
4 months ago · 803115f504
parent 994bafa31f
commit 803115f504
7 changed files with 3075 additions and 0 deletions
--- a/RAG_POC/architecture-data-model.md
+++ b/RAG_POC/architecture-data-model.md
@ -0,0 +1,384 @@
+# ProxySQL RAG Index — Data Model & Ingestion Architecture (v0 Blueprint)
+
+This document explains the SQLite data model used to turn relational tables (e.g. MySQL `posts`) into a retrieval-friendly index hosted inside ProxySQL. It focuses on:
+
+- What each SQLite table does
+- How tables relate to each other
+- How `rag_sources` defines **explicit mapping rules** (no guessing)
+- How ingestion transforms rows into documents and chunks
+- How FTS and vector indexes are maintained
+- What evolves later for incremental sync and updates
+
+---
+
+## 1. Goal and core idea
+
+Relational databases are excellent for structured queries, but RAG-style retrieval needs:
+
+- Fast keyword search (error messages, identifiers, tags)
+- Fast semantic search (similar meaning, paraphrased questions)
+- A stable way to “refetch the authoritative data” from the source DB
+
+The model below implements a **canonical document layer** inside ProxySQL:
+
+1. Ingest selected rows from a source database (MySQL, PostgreSQL, etc.)
+2. Convert each row into a **document** (title/body + metadata)
+3. Split long bodies into **chunks**
+4. Index chunks in:
+   - **FTS5** for keyword search
+   - **sqlite3-vec** for vector similarity
+5. Serve retrieval through stable APIs (MCP or SQL), independent of where indexes physically live in the future
+
+---
+
+## 2. The SQLite tables (what they are and why they exist)
+
+### 2.1 `rag_sources` — control plane: “what to ingest and how”
+
+**Purpose**
+- Defines each ingestion source (a table or view in an external DB)
+- Stores *explicit* transformation rules:
+  - which columns become `title`, `body`
+  - which columns go into `metadata_json`
+  - how to build `doc_id`
+- Stores chunking strategy and embedding strategy configuration
+
+**Key columns**
+- `backend_*`: how to connect (v0 connects directly; later may be “via ProxySQL”)
+- `table_name`, `pk_column`: what to ingest
+- `where_sql`: optional restriction (e.g. only questions)
+- `doc_map_json`: mapping rules (required)
+- `chunking_json`: chunking rules (required)
+- `embedding_json`: embedding rules (optional)
+
+**Important**: `rag_sources` is the **only place** that defines mapping logic.  
+A general-purpose ingester must never “guess” which fields belong to `body` or metadata.
+
+---
+
+### 2.2 `rag_documents` — canonical documents: “one per source row”
+
+**Purpose**
+- Represents the canonical document created from a single source row.
+- Stores:
+  - a stable identifier (`doc_id`)
+  - a refetch pointer (`pk_json`)
+  - document text (`title`, `body`)
+  - structured metadata (`metadata_json`)
+
+**Why store full `body` here?**
+- Enables re-chunking later without re-fetching from the source DB.
+- Makes debugging and inspection easier.
+- Supports future update detection and diffing.
+
+**Key columns**
+- `doc_id` (PK): stable across runs and machines (e.g. `"posts:12345"`)
+- `source_id`: ties back to `rag_sources`
+- `pk_json`: how to refetch the authoritative row later (e.g. `{"Id":12345}`)
+- `title`, `body`: canonical text
+- `metadata_json`: non-text signals used for filters/boosting
+- `updated_at`, `deleted`: lifecycle fields for incremental sync later
+
+---
+
+### 2.3 `rag_chunks` — retrieval units: “one or many per document”
+
+**Purpose**
+- Stores chunked versions of a document’s text.
+- Retrieval and embeddings are performed at the chunk level for better quality.
+
+**Why chunk at all?**
+- Long bodies reduce retrieval quality:
+  - FTS returns large documents where only a small part is relevant
+  - Vector embeddings of large texts smear multiple topics together
+- Chunking yields:
+  - better precision
+  - better citations (“this chunk”) and smaller context
+  - cheaper updates (only re-embed changed chunks later)
+
+**Key columns**
+- `chunk_id` (PK): stable, derived from doc_id + chunk index (e.g. `"posts:12345#0"`)
+- `doc_id` (FK): parent document
+- `source_id`: convenience for filtering without joining documents
+- `chunk_index`: 0..N-1
+- `title`, `body`: chunk text (often title repeated for context)
+- `metadata_json`: optional chunk-level metadata (offsets, “has_code”, section label)
+- `updated_at`, `deleted`: lifecycle for later incremental sync
+
+---
+
+### 2.4 `rag_fts_chunks` — FTS5 index (contentless)
+
+**Purpose**
+- Keyword search index for chunks.
+- Best for:
+  - exact terms
+  - identifiers
+  - error messages
+  - tags and code tokens (depending on tokenization)
+
+**Design choice: contentless FTS**
+- The FTS virtual table does not automatically mirror `rag_chunks`.
+- The ingester explicitly inserts into FTS as chunks are created.
+- This makes ingestion deterministic and avoids surprises when chunk bodies change later.
+
+**Stored fields**
+- `chunk_id` (unindexed, acts like a row identifier)
+- `title`, `body` (indexed)
+
+---
+
+### 2.5 `rag_vec_chunks` — vector index (sqlite3-vec)
+
+**Purpose**
+- Semantic similarity search over chunks.
+- Each chunk has a vector embedding.
+
+**Key columns**
+- `embedding float[DIM]`: embedding vector (DIM must match your model)
+- `chunk_id`: join key to `rag_chunks`
+- Optional metadata columns:
+  - `doc_id`, `source_id`, `updated_at`
+  - These help filtering and joining and are valuable for performance.
+
+**Note**
+- The ingester decides what text is embedded (chunk body alone, or “Title + Tags + Body chunk”).
+
+---
+
+### 2.6 Optional convenience objects
+- `rag_chunk_view`: joins `rag_chunks` with `rag_documents` for debugging/inspection
+- `rag_sync_state`: reserved for incremental sync later (not used in v0)
+
+---
+
+## 3. Table relationships (the graph)
+
+Think of this as a data pipeline graph:
+
+```text
+rag_sources
+   (defines mapping + chunking + embedding)
+        |
+        v
+rag_documents  (1 row per source row)
+        |
+        v
+rag_chunks     (1..N chunks per document)
+     /     \
+    v       v
+rag_fts    rag_vec
+```
+
+**Cardinality**
+- `rag_sources (1) -> rag_documents (N)`
+- `rag_documents (1) -> rag_chunks (N)`
+- `rag_chunks (1) -> rag_fts_chunks (1)` (insertion done by ingester)
+- `rag_chunks (1) -> rag_vec_chunks (0/1+)` (0 if embeddings disabled; 1 typically)
+
+---
+
+## 4. How mapping is defined (no guessing)
+
+### 4.1 Why `doc_map_json` exists
+A general-purpose system cannot infer that:
+- `posts.Body` should become document body
+- `posts.Title` should become title
+- `Score`, `Tags`, `CreationDate`, etc. should become metadata
+- Or how to concatenate fields
+
+Therefore, `doc_map_json` is required.
+
+### 4.2 `doc_map_json` structure (v0)
+`doc_map_json` defines:
+
+- `doc_id.format`: string template with `{ColumnName}` placeholders
+- `title.concat`: concatenation spec
+- `body.concat`: concatenation spec
+- `metadata.pick`: list of column names to include in metadata JSON
+- `metadata.rename`: mapping of old key -> new key (useful for typos or schema differences)
+
+**Concatenation parts**
+- `{"col":"Column"}` — appends the column value (if present)
+- `{"lit":"..."} ` — appends a literal string
+
+Example (posts-like):
+
+```json
+{
+  "doc_id": { "format": "posts:{Id}" },
+  "title":  { "concat": [ { "col": "Title" } ] },
+  "body":   { "concat": [ { "col": "Body" } ] },
+  "metadata": {
+    "pick": ["Id","PostTypeId","Tags","Score","CreaionDate"],
+    "rename": {"CreaionDate":"CreationDate"}
+  }
+}
+```
+
+---
+
+## 5. Chunking strategy definition
+
+### 5.1 Why chunking is configured per source
+Different tables need different chunking:
+- StackOverflow `Body` may be long -> chunking recommended
+- Small “reference” tables may not need chunking at all
+
+Thus chunking is stored in `rag_sources.chunking_json`.
+
+### 5.2 `chunking_json` structure (v0)
+v0 supports **chars-based** chunking (simple, robust).
+
+```json
+{
+  "enabled": true,
+  "unit": "chars",
+  "chunk_size": 4000,
+  "overlap": 400,
+  "min_chunk_size": 800
+}
+```
+
+**Behavior**
+- If `body.length <= chunk_size` -> one chunk
+- Else chunks of `chunk_size` with `overlap`
+- Avoid tiny final chunks by appending the tail to the previous chunk if below `min_chunk_size`
+
+**Why overlap matters**
+- Prevents splitting a key sentence or code snippet across boundaries
+- Improves both FTS and semantic retrieval consistency
+
+---
+
+## 6. Embedding strategy definition (where it fits in the model)
+
+### 6.1 Why embeddings are per chunk
+- Better retrieval precision
+- Smaller context per match
+- Allows partial updates later (only re-embed changed chunks)
+
+### 6.2 `embedding_json` structure (v0)
+```json
+{
+  "enabled": true,
+  "dim": 1536,
+  "model": "text-embedding-3-large",
+  "input": { "concat": [
+    {"col":"Title"},
+    {"lit":"\nTags: "}, {"col":"Tags"},
+    {"lit":"\n\n"},
+    {"chunk_body": true}
+  ]}
+}
+```
+
+**Meaning**
+- Build embedding input text from:
+  - title
+  - tags (as plain text)
+  - chunk body
+
+This improves semantic retrieval for question-like content without embedding numeric metadata.
+
+---
+
+## 7. Ingestion lifecycle (step-by-step)
+
+For each enabled `rag_sources` entry:
+
+1. **Connect** to source DB using `backend_*`
+2. **Select rows** from `table_name` (and optional `where_sql`)
+   - Select only needed columns determined by `doc_map_json` and `embedding_json`
+3. For each row:
+   - Build `doc_id` using `doc_map_json.doc_id.format`
+   - Build `pk_json` from `pk_column`
+   - Build `title` using `title.concat`
+   - Build `body` using `body.concat`
+   - Build `metadata_json` using `metadata.pick` and `metadata.rename`
+4. **Skip** if `doc_id` already exists (v0 behavior)
+5. Insert into `rag_documents`
+6. Chunk `body` using `chunking_json`
+7. For each chunk:
+   - Insert into `rag_chunks`
+   - Insert into `rag_fts_chunks`
+   - If embeddings enabled:
+     - Build embedding input text using `embedding_json.input`
+     - Compute embedding
+     - Insert into `rag_vec_chunks`
+8. Commit (ideally in a transaction for performance)
+
+---
+
+## 8. What changes later (incremental sync and updates)
+
+v0 is “insert-only and skip-existing.”  
+Product-grade ingestion requires:
+
+### 8.1 Detecting changes
+Options:
+- Watermark by `LastActivityDate` / `updated_at` column
+- Hash (e.g. `sha256(title||body||metadata)`) stored in documents table
+- Compare chunk hashes to re-embed only changed chunks
+
+### 8.2 Updating and deleting
+Needs:
+- Upsert documents
+- Delete or mark `deleted=1` when source row deleted
+- Rebuild chunks and indexes when body changes
+- Maintain FTS rows:
+  - delete old chunk rows from FTS
+  - insert updated chunk rows
+
+### 8.3 Checkpoints
+Use `rag_sync_state` to store:
+- last ingested timestamp
+- GTID/LSN for CDC
+- or a monotonic PK watermark
+
+The current schema already includes:
+- `updated_at` and `deleted`
+- `rag_sync_state` placeholder
+
+So incremental sync can be added without breaking the data model.
+
+---
+
+## 9. Practical example: mapping `posts` table
+
+Given a MySQL `posts` row:
+
+- `Id = 12345`
+- `Title = "How to parse JSON in MySQL 8?"`
+- `Body = "<p>I tried JSON_EXTRACT...</p>"`
+- `Tags = "<mysql><json>"`
+- `Score = 12`
+
+With mapping:
+
+- `doc_id = "posts:12345"`
+- `title = Title`
+- `body = Body`
+- `metadata_json` includes `{ "Tags": "...", "Score": "12", ... }`
+- chunking splits body into:
+  - `posts:12345#0`, `posts:12345#1`, etc.
+- FTS is populated with the chunk text
+- vectors are stored per chunk
+
+---
+
+## 10. Summary
+
+This data model separates concerns cleanly:
+
+- `rag_sources` defines *policy* (what/how to ingest)
+- `rag_documents` defines canonical *identity and refetch pointer*
+- `rag_chunks` defines retrieval *units*
+- `rag_fts_chunks` defines keyword search
+- `rag_vec_chunks` defines semantic search
+
+This separation makes the system:
+- general purpose (works for many schemas)
+- deterministic (no magic inference)
+- extensible to incremental sync, external indexes, and richer hybrid retrieval
+
--- a/RAG_POC/architecture-runtime-retrieval.md
+++ b/RAG_POC/architecture-runtime-retrieval.md
@ -0,0 +1,344 @@
+# ProxySQL RAG Engine — Runtime Retrieval Architecture (v0 Blueprint)
+
+This document describes how ProxySQL becomes a **RAG retrieval engine** at runtime. The companion document (Data Model & Ingestion) explains how content enters the SQLite index. This document explains how content is **queried**, how results are **returned to agents/applications**, and how **hybrid retrieval** works in practice.
+
+It is written as an implementation blueprint for ProxySQL (and its MCP server) and assumes the SQLite schema contains:
+
+- `rag_sources` (control plane)
+- `rag_documents` (canonical docs)
+- `rag_chunks` (retrieval units)
+- `rag_fts_chunks` (FTS5)
+- `rag_vec_chunks` (sqlite3-vec vectors)
+
+---
+
+## 1. The runtime role of ProxySQL in a RAG system
+
+ProxySQL becomes a RAG runtime by providing four capabilities in one bounded service:
+
+1. **Retrieval Index Host**
+   - Hosts the SQLite index and search primitives (FTS + vectors).
+   - Offers deterministic query semantics and strict budgets.
+
+2. **Orchestration Layer**
+   - Implements search flows (FTS, vector, hybrid, rerank).
+   - Applies filters, caps, and result shaping.
+
+3. **Stable API Surface (MCP-first)**
+   - LLM agents call MCP tools (not raw SQL).
+   - Tool contracts remain stable even if internal storage changes.
+
+4. **Authoritative Row Refetch Gateway**
+   - After retrieval returns `doc_id` / `pk_json`, ProxySQL can refetch the authoritative row from the source DB on-demand (optional).
+   - This avoids returning stale or partial data when the full row is needed.
+
+In production terms, this is not “ProxySQL as a general search engine.” It is a **bounded retrieval service** colocated with database access logic.
+
+---
+
+## 2. High-level query flow (agent-centric)
+
+A typical RAG flow has two phases:
+
+### Phase A — Retrieval (fast, bounded, cheap)
+- Query the index to obtain a small number of relevant chunks (and their parent doc identity).
+- Output includes `chunk_id`, `doc_id`, `score`, and small metadata.
+
+### Phase B — Fetch (optional, authoritative, bounded)
+- If the agent needs full context or structured fields, it refetches the authoritative row from the source DB using `pk_json`.
+- This avoids scanning large tables and avoids shipping huge payloads in Phase A.
+
+**Canonical flow**
+1. `rag.search_hybrid(query, filters, k)` → returns top chunk ids and scores
+2. `rag.get_chunks(chunk_ids)` → returns chunk text for prompt grounding/citations
+3. Optional: `rag.fetch_from_source(doc_id)` → returns full row or selected columns
+
+---
+
+## 3. Runtime interfaces: MCP vs SQL
+
+ProxySQL should support two “consumption modes”:
+
+### 3.1 MCP tools (preferred for AI agents)
+- Strict limits and predictable response schemas.
+- Tools return structured results and avoid SQL injection concerns.
+- Agents do not need direct DB access.
+
+### 3.2 SQL access (for standard applications / debugging)
+- Applications may connect to ProxySQL’s SQLite admin interface (or a dedicated port) and issue SQL.
+- Useful for:
+  - internal dashboards
+  - troubleshooting
+  - non-agent apps that want retrieval but speak SQL
+
+**Principle**
+- MCP is the stable, long-term interface.
+- SQL is optional and may be restricted to trusted callers.
+
+---
+
+## 4. Retrieval primitives
+
+### 4.1 FTS retrieval (keyword / exact match)
+
+FTS5 is used for:
+- error messages
+- identifiers and function names
+- tags and exact terms
+- “grep-like” queries
+
+**Typical output**
+- `chunk_id`, `score_fts`, optional highlights/snippets
+
+**Ranking**
+- `bm25(rag_fts_chunks)` is the default. It is fast and effective for term queries.
+
+### 4.2 Vector retrieval (semantic similarity)
+
+Vector search is used for:
+- paraphrased questions
+- semantic similarity (“how to do X” vs “best way to achieve X”)
+- conceptual matching that is poor with keyword-only search
+
+**Typical output**
+- `chunk_id`, `score_vec` (distance/similarity), plus join metadata
+
+**Important**
+- Vectors are generally computed per chunk.
+- Filters are applied via `source_id` and joins to `rag_chunks` / `rag_documents`.
+
+---
+
+## 5. Hybrid retrieval patterns (two recommended modes)
+
+Hybrid retrieval combines FTS and vector search for better quality than either alone. Two concrete modes should be implemented because they solve different problems.
+
+### Mode 1 — “Best of both” (parallel FTS + vector; fuse results)
+**Use when**
+- the query may contain both exact tokens (e.g. error messages) and semantic intent
+
+**Flow**
+1. Run FTS top-N (e.g. N=50)
+2. Run vector top-N (e.g. N=50)
+3. Merge results by `chunk_id`
+4. Score fusion (recommended): Reciprocal Rank Fusion (RRF)
+5. Return top-k (e.g. k=10)
+
+**Why RRF**
+- Robust without score calibration
+- Works across heterogeneous score ranges (bm25 vs cosine distance)
+
+**RRF formula**
+- For each candidate chunk:
+  - `score = w_fts/(k0 + rank_fts) + w_vec/(k0 + rank_vec)`
+  - Typical: `k0=60`, `w_fts=1.0`, `w_vec=1.0`
+
+### Mode 2 — “Broad FTS then vector refine” (candidate generation + rerank)
+**Use when**
+- you want strong precision anchored to exact term matches
+- you want to avoid vector search over the entire corpus
+
+**Flow**
+1. Run broad FTS query top-M (e.g. M=200)
+2. Fetch chunk texts for those candidates
+3. Compute vector similarity of query embedding to candidate embeddings
+4. Return top-k
+
+This mode behaves like a two-stage retrieval pipeline:
+- Stage 1: cheap recall (FTS)
+- Stage 2: precise semantic rerank within candidates
+
+---
+
+## 6. Filters, constraints, and budgets (blast-radius control)
+
+A RAG retrieval engine must be bounded. ProxySQL should enforce limits at the MCP layer and ideally also at SQL helper functions.
+
+### 6.1 Hard caps (recommended defaults)
+- Maximum `k` returned: 50
+- Maximum candidates for broad-stage: 200–500
+- Maximum query length: e.g. 2–8 KB
+- Maximum response bytes: e.g. 1–5 MB
+- Maximum execution time per request: e.g. 50–250 ms for retrieval, 1–2 s for fetch
+
+### 6.2 Filter semantics
+Filters should be applied consistently across retrieval modes.
+
+Common filters:
+- `source_id` or `source_name`
+- tag include/exclude (via metadata_json parsing or pre-extracted tag fields later)
+- post type (question vs answer)
+- minimum score
+- time range (creation date / last activity)
+
+Implementation note:
+- v0 stores metadata in JSON; filtering can be implemented in MCP layer or via SQLite JSON functions (if enabled).
+- For performance, later versions should denormalize key metadata into dedicated columns or side tables.
+
+---
+
+## 7. Result shaping and what the caller receives
+
+A retrieval response must be designed for downstream LLM usage:
+
+### 7.1 Retrieval results (Phase A)
+Return a compact list of “evidence candidates”:
+
+- `chunk_id`
+- `doc_id`
+- `scores` (fts, vec, fused)
+- short `title`
+- minimal metadata (source, tags, timestamp, etc.)
+
+Do **not** return full bodies by default; that is what `rag.get_chunks` is for.
+
+### 7.2 Chunk fetch results (Phase A.2)
+`rag.get_chunks(chunk_ids)` returns:
+
+- `chunk_id`, `doc_id`
+- `title`
+- `body` (chunk text)
+- optionally a snippet/highlight for display
+
+### 7.3 Source refetch results (Phase B)
+`rag.fetch_from_source(doc_id)` returns:
+- either the full row
+- or a selected subset of columns (recommended)
+
+This is the “authoritative fetch” boundary that prevents stale/partial index usage from being a correctness problem.
+
+---
+
+## 8. SQL examples (runtime extraction)
+
+These are not the preferred agent interface, but they are crucial for debugging and for SQL-native apps.
+
+### 8.1 FTS search (top 10)
+```sql
+SELECT
+  f.chunk_id,
+  bm25(rag_fts_chunks) AS score_fts
+FROM rag_fts_chunks f
+WHERE rag_fts_chunks MATCH 'json_extract mysql'
+ORDER BY score_fts
+LIMIT 10;
+```
+
+Join to fetch text:
+```sql
+SELECT
+  f.chunk_id,
+  bm25(rag_fts_chunks) AS score_fts,
+  c.doc_id,
+  c.body
+FROM rag_fts_chunks f
+JOIN rag_chunks c ON c.chunk_id = f.chunk_id
+WHERE rag_fts_chunks MATCH 'json_extract mysql'
+ORDER BY score_fts
+LIMIT 10;
+```
+
+### 8.2 Vector search (top 10)
+Vector syntax depends on how you expose query vectors. A typical pattern is:
+
+1) Bind a query vector into a function / parameter
+2) Use `rag_vec_chunks` to return nearest neighbors
+
+Example shape (conceptual):
+```sql
+-- Pseudocode: nearest neighbors for :query_embedding
+SELECT
+  v.chunk_id,
+  v.distance
+FROM rag_vec_chunks v
+WHERE v.embedding MATCH :query_embedding
+ORDER BY v.distance
+LIMIT 10;
+```
+
+In production, ProxySQL MCP will typically compute the query embedding and call SQL internally with a bound parameter.
+
+---
+
+## 9. MCP tools (runtime API surface)
+
+This document does not define full schemas (that is in `mcp-tools.md`), but it defines what each tool must do.
+
+### 9.1 Retrieval
+- `rag.search_fts(query, filters, k)`
+- `rag.search_vector(query_text | query_embedding, filters, k)`
+- `rag.search_hybrid(query, mode, filters, k, params)`
+  - Mode 1: parallel + RRF fuse
+  - Mode 2: broad FTS candidates + vector rerank
+
+### 9.2 Fetch
+- `rag.get_chunks(chunk_ids)`
+- `rag.get_docs(doc_ids)`
+- `rag.fetch_from_source(doc_ids | pk_json, columns?, limits?)`
+
+**MCP-first principle**
+- Agents do not see SQLite schema or SQL.
+- MCP tools remain stable even if you move index storage out of ProxySQL later.
+
+---
+
+## 10. Operational considerations
+
+### 10.1 Dedicated ProxySQL instance
+Run GenAI retrieval in a dedicated ProxySQL instance to reduce blast radius:
+- independent CPU/memory budgets
+- independent configuration and rate limits
+- independent failure domain
+
+### 10.2 Observability and metrics (minimum)
+- count of docs/chunks per source
+- query counts by tool and source
+- p50/p95 latency for:
+  - FTS
+  - vector
+  - hybrid
+  - refetch
+- dropped/limited requests (rate limit hit, cap exceeded)
+- error rate and error categories
+
+### 10.3 Safety controls
+- strict upper bounds on `k` and candidate sizes
+- strict timeouts
+- response size caps
+- optional allowlists for sources accessible to agents
+- tenant boundaries via filters (strongly recommended for multi-tenant)
+
+---
+
+## 11. Recommended “v0-to-v1” evolution checklist
+
+### v0 (PoC)
+- ingestion to docs/chunks
+- FTS search
+- vector search (if embedding pipeline available)
+- simple hybrid search
+- chunk fetch
+- manual/limited source refetch
+
+### v1 (product hardening)
+- incremental sync checkpoints (`rag_sync_state`)
+- update detection (hashing/versioning)
+- delete handling
+- robust hybrid search:
+  - RRF fuse
+  - candidate-generation rerank
+- stronger filtering semantics (denormalized metadata columns)
+- quotas, rate limits, per-source budgets
+- full MCP tool contracts + tests
+
+---
+
+## 12. Summary
+
+At runtime, ProxySQL RAG retrieval is implemented as:
+
+- **Index query** (FTS/vector/hybrid) returning a small set of chunk IDs
+- **Chunk fetch** returning the text that the LLM will ground on
+- Optional **authoritative refetch** from the source DB by primary key
+- Strict limits and consistent filtering to keep the service bounded
+
--- a/RAG_POC/embeddings-design.md
+++ b/RAG_POC/embeddings-design.md
@ -0,0 +1,353 @@
+# ProxySQL RAG Index — Embeddings & Vector Retrieval Design (Chunk-Level) (v0→v1 Blueprint)
+
+This document specifies how embeddings should be produced, stored, updated, and queried for chunk-level vector search in ProxySQL’s RAG index. It is intended as an implementation blueprint.
+
+It assumes:
+- Chunking is already implemented (`rag_chunks`).
+- ProxySQL includes **sqlite3-vec** and uses a `vec0(...)` virtual table (`rag_vec_chunks`).
+- Retrieval is exposed primarily via MCP tools (`mcp-tools.md`).
+
+---
+
+## 1. Design objectives
+
+1. **Chunk-level embeddings**
+   - Each chunk receives its own embedding for retrieval precision.
+
+2. **Deterministic embedding input**
+   - The text embedded is explicitly defined per source, not inferred.
+
+3. **Model agility**
+   - The system can change embedding models/dimensions without breaking stored data or APIs.
+
+4. **Efficient updates**
+   - Only recompute embeddings for chunks whose embedding input changed.
+
+5. **Operational safety**
+   - Bound cost and latency (embedding generation can be expensive).
+   - Allow asynchronous embedding jobs if needed later.
+
+---
+
+## 2. What to embed (and what not to embed)
+
+### 2.1 Embed text that improves semantic retrieval
+Recommended embedding input per chunk:
+
+- Document title (if present)
+- Tags (as plain text)
+- Chunk body
+
+Example embedding input template:
+```
+{Title}
+Tags: {Tags}
+
+{ChunkBody}
+```
+
+This typically improves semantic recall significantly for knowledge-base-like content (StackOverflow posts, docs, tickets, runbooks).
+
+### 2.2 Do NOT embed numeric metadata by default
+Do not embed fields like `Score`, `ViewCount`, `OwnerUserId`, timestamps, etc. These should remain structured and be used for:
+- filtering
+- boosting
+- tie-breaking
+- result shaping
+
+Embedding numeric metadata into text typically adds noise and reduces semantic quality.
+
+### 2.3 Code and HTML considerations
+If your chunk body contains HTML or code:
+- **v0**: embed raw text (works, but may be noisy)
+- **v1**: normalize to improve quality:
+  - strip HTML tags (keep text content)
+  - preserve code blocks as text, but consider stripping excessive markup
+  - optionally create specialized “code-only” chunks for code-heavy sources
+
+Normalization should be source-configurable.
+
+---
+
+## 3. Where embedding input rules are defined
+
+Embedding input rules must be explicit and stored per source.
+
+### 3.1 `rag_sources.embedding_json`
+Recommended schema:
+```json
+{
+  "enabled": true,
+  "model": "text-embedding-3-large",
+  "dim": 1536,
+  "input": {
+    "concat": [
+      {"col":"Title"},
+      {"lit":"\nTags: "}, {"col":"Tags"},
+      {"lit":"\n\n"},
+      {"chunk_body": true}
+    ]
+  },
+  "normalize": {
+    "strip_html": true,
+    "collapse_whitespace": true
+  }
+}
+```
+
+**Semantics**
+- `enabled`: whether to compute/store embeddings for this source
+- `model`: logical name (for observability and compatibility checks)
+- `dim`: vector dimension
+- `input.concat`: how to build embedding input text
+- `normalize`: optional normalization steps
+
+---
+
+## 4. Storage schema and model/versioning
+
+### 4.1 Current v0 schema: single vector table
+`rag_vec_chunks` stores:
+- embedding vector
+- chunk_id
+- doc_id/source_id convenience columns
+- updated_at
+
+This is appropriate for v0 when you assume a single embedding model/dimension.
+
+### 4.2 Recommended v1 evolution: support multiple models
+In a product setting, you may want multiple embedding models (e.g. general vs code-centric).
+
+Two ways to support this:
+
+#### Option A: include model identity columns in `rag_vec_chunks`
+Add columns:
+- `model TEXT`
+- `dim INTEGER` (optional if fixed per model)
+
+Then allow multiple rows per `chunk_id` (unique key becomes `(chunk_id, model)`).
+This may require schema change and a different vec0 design (some vec0 configurations support metadata columns, but uniqueness must be handled carefully).
+
+#### Option B: one vec table per model (recommended if vec0 constraints exist)
+Create:
+- `rag_vec_chunks_1536_v1`
+- `rag_vec_chunks_1024_code_v1`
+etc.
+
+Then MCP tools select the table based on requested model or default configuration.
+
+**Recommendation**
+Start with Option A only if your sqlite3-vec build makes it easy to filter by model. Otherwise, Option B is operationally cleaner.
+
+---
+
+## 5. Embedding generation pipeline
+
+### 5.1 When embeddings are created
+Embeddings are created during ingestion, immediately after chunk creation, if `embedding_json.enabled=true`.
+
+This provides a simple, synchronous pipeline:
+- ingest row → create chunks → compute embedding → store vector
+
+### 5.2 When embeddings should be updated
+Embeddings must be recomputed if the *embedding input string* changes. That depends on:
+- title changes
+- tags changes
+- chunk body changes
+- normalization rules changes (strip_html etc.)
+- embedding model changes
+
+Therefore, update logic should be based on a **content hash** of the embedding input.
+
+---
+
+## 6. Content hashing for efficient updates (v1 recommendation)
+
+### 6.1 Why hashing is needed
+Without hashing, you might recompute embeddings unnecessarily:
+- expensive
+- slow
+- prevents incremental sync from being efficient
+
+### 6.2 Recommended approach
+Store `embedding_input_hash` per chunk per model.
+
+Implementation options:
+
+#### Option A: Store hash in `rag_chunks.metadata_json`
+Example:
+```json
+{
+  "chunk_index": 0,
+  "embedding_hash": "sha256:...",
+  "embedding_model": "text-embedding-3-large"
+}
+```
+
+Pros: no schema changes.  
+Cons: JSON parsing overhead.
+
+#### Option B: Dedicated side table (recommended)
+Create `rag_chunk_embedding_state`:
+
+```sql
+CREATE TABLE rag_chunk_embedding_state (
+  chunk_id   TEXT NOT NULL,
+  model      TEXT NOT NULL,
+  dim        INTEGER NOT NULL,
+  input_hash TEXT NOT NULL,
+  updated_at INTEGER NOT NULL DEFAULT (unixepoch()),
+  PRIMARY KEY(chunk_id, model)
+);
+```
+
+Pros: fast lookups; avoids JSON parsing.  
+Cons: extra table.
+
+**Recommendation**
+Use Option B for v1.
+
+---
+
+## 7. Embedding model integration options
+
+### 7.1 External embedding service (recommended initially)
+ProxySQL calls an embedding service:
+- OpenAI-compatible endpoint, or
+- local service (e.g. llama.cpp server), or
+- vendor-specific embedding API
+
+Pros:
+- easy to iterate on model choice
+- isolates ML runtime from ProxySQL process
+
+Cons:
+- network latency; requires caching and timeouts
+
+### 7.2 Embedded model runtime inside ProxySQL
+ProxySQL links to an embedding runtime (llama.cpp, etc.)
+
+Pros:
+- no network dependency
+- predictable latency if tuned
+
+Cons:
+- increases memory footprint
+- needs careful resource controls
+
+**Recommendation**
+Start with an external embedding provider and keep a modular interface that can be swapped later.
+
+---
+
+## 8. Query embedding generation
+
+Vector search needs a query embedding. Do this in the MCP layer:
+
+1. Take `query_text`
+2. Apply query normalization (optional but recommended)
+3. Compute query embedding using the same model used for chunks
+4. Execute vector search SQL with a bound embedding vector
+
+**Do not**
+- accept arbitrary embedding vectors from untrusted callers without validation
+- allow unbounded query lengths
+
+---
+
+## 9. Vector search semantics
+
+### 9.1 Distance vs similarity
+Depending on the embedding model and vec search primitive, vector search may return:
+- cosine distance (lower is better)
+- cosine similarity (higher is better)
+- L2 distance (lower is better)
+
+**Recommendation**
+Normalize to a “higher is better” score in MCP responses:
+- if distance: `score_vec = 1 / (1 + distance)` or similar monotonic transform
+
+Keep raw distance in debug fields if needed.
+
+### 9.2 Filtering
+Filtering should be supported by:
+- `source_id` restriction
+- optional metadata filters (doc-level or chunk-level)
+
+In v0, filter by `source_id` is easiest because `rag_vec_chunks` stores `source_id` as metadata.
+
+---
+
+## 10. Hybrid retrieval integration
+
+Embeddings are one leg of hybrid retrieval. Two recommended hybrid modes are described in `mcp-tools.md`:
+
+1. **Fuse**: top-N FTS and top-N vector, merged by chunk_id, fused by RRF
+2. **FTS then vector**: broad FTS candidates then vector rerank within candidates
+
+Embeddings support both:
+- Fuse mode needs global vector search top-N.
+- Candidate mode needs vector search restricted to candidate chunk IDs.
+
+Candidate mode is often cheaper and more precise when the query includes strong exact tokens.
+
+---
+
+## 11. Operational controls
+
+### 11.1 Resource limits
+Embedding generation must be bounded by:
+- max chunk size embedded
+- max chunks embedded per document
+- per-source embedding rate limit
+- timeouts when calling embedding provider
+
+### 11.2 Batch embedding
+To improve throughput, embed in batches:
+- collect N chunks
+- send embedding request for N inputs
+- store results
+
+### 11.3 Backpressure and async embedding
+For v1, consider decoupling embedding generation from ingestion:
+- ingestion stores chunks
+- embedding worker processes “pending” chunks and fills vectors
+
+This allows:
+- ingestion to remain fast
+- embedding to scale independently
+- retries on embedding failures
+
+In this design, store a state record:
+- pending / ok / error
+- last error message
+- retry count
+
+---
+
+## 12. Recommended implementation steps (coding agent checklist)
+
+### v0 (synchronous embedding)
+1. Implement `embedding_json` parsing in ingester
+2. Build embedding input string for each chunk
+3. Call embedding provider (or use a stub in development)
+4. Insert vector rows into `rag_vec_chunks`
+5. Implement `rag.search_vector` MCP tool using query embedding + vector SQL
+
+### v1 (efficient incremental embedding)
+1. Add `rag_chunk_embedding_state` table
+2. Store `input_hash` per chunk per model
+3. Only re-embed if hash changed
+4. Add async embedding worker option
+5. Add metrics for embedding throughput and failures
+
+---
+
+## 13. Summary
+
+- Compute embeddings per chunk, not per document.
+- Define embedding input explicitly in `rag_sources.embedding_json`.
+- Store vectors in `rag_vec_chunks` (vec0).
+- For production, add hash-based update detection and optional async embedding workers.
+- Normalize vector scores in MCP responses and keep raw distance for debugging.
+
--- a/RAG_POC/mcp-tools.md
+++ b/RAG_POC/mcp-tools.md
@ -0,0 +1,465 @@
+# MCP Tooling for ProxySQL RAG Engine (v0 Blueprint)
+
+This document defines the MCP tool surface for querying ProxySQL’s embedded RAG index. It is intended as a stable interface for AI agents. Internally, these tools query the SQLite schema described in `schema.sql` and the retrieval logic described in `architecture-runtime-retrieval.md`.
+
+**Design goals**
+- Stable tool contracts (do not break agents when internals change)
+- Strict bounds (prevent unbounded scans / large outputs)
+- Deterministic schemas (agents can reliably parse outputs)
+- Separation of concerns:
+  - Retrieval returns identifiers and scores
+  - Fetch returns content
+  - Optional refetch returns authoritative source rows
+
+---
+
+## 1. Conventions
+
+### 1.1 Identifiers
+- `doc_id`: stable document identifier (e.g. `posts:12345`)
+- `chunk_id`: stable chunk identifier (e.g. `posts:12345#0`)
+- `source_id` / `source_name`: corresponds to `rag_sources`
+
+### 1.2 Scores
+- FTS score: `score_fts` (bm25; lower is better in SQLite’s bm25 by default)
+- Vector score: `score_vec` (distance or similarity, depending on implementation)
+- Hybrid score: `score` (normalized fused score; higher is better)
+
+**Recommendation**
+Normalize scores in MCP layer so:
+- higher is always better for agent ranking
+- raw internal ranking can still be returned as `score_fts_raw`, `distance_raw`, etc. if helpful
+
+### 1.3 Limits and budgets (recommended defaults)
+All tools should enforce caps, regardless of caller input:
+- `k_max = 50`
+- `candidates_max = 500`
+- `query_max_bytes = 8192`
+- `response_max_bytes = 5_000_000`
+- `timeout_ms` (per tool): 250–2000ms depending on tool type
+
+Tools must return a `truncated` boolean if limits reduce output.
+
+---
+
+## 2. Shared filter model
+
+Many tools accept the same filter structure. This is intentionally simple in v0.
+
+### 2.1 Filter object
+```json
+{
+  "source_ids": [1,2],
+  "source_names": ["stack_posts"],
+  "doc_ids": ["posts:12345"],
+  "min_score": 5,
+  "post_type_ids": [1],
+  "tags_any": ["mysql","json"],
+  "tags_all": ["mysql","json"],
+  "created_after": "2022-01-01T00:00:00Z",
+  "created_before": "2025-01-01T00:00:00Z"
+}
+```
+
+**Notes**
+- In v0, most filters map to `metadata_json` values. Implementation can:
+  - filter in SQLite if JSON functions are available, or
+  - filter in MCP layer after initial retrieval (acceptable for small k/candidates)
+- For production, denormalize hot filters into dedicated columns for speed.
+
+### 2.2 Filter behavior
+- If both `source_ids` and `source_names` are provided, treat as intersection.
+- If no source filter is provided, default to all enabled sources **but** enforce a strict global budget.
+
+---
+
+## 3. Tool: `rag.search_fts`
+
+Keyword search over `rag_fts_chunks`.
+
+### 3.1 Request schema
+```json
+{
+  "query": "json_extract mysql",
+  "k": 10,
+  "offset": 0,
+  "filters": { },
+  "return": {
+    "include_title": true,
+    "include_metadata": true,
+    "include_snippets": false
+  }
+}
+```
+
+### 3.2 Semantics
+- Executes FTS query (MATCH) over indexed content.
+- Returns top-k chunk matches with scores and identifiers.
+- Does not return full chunk bodies unless `include_snippets` is requested (still bounded).
+
+### 3.3 Response schema
+```json
+{
+  "results": [
+    {
+      "chunk_id": "posts:12345#0",
+      "doc_id": "posts:12345",
+      "source_id": 1,
+      "source_name": "stack_posts",
+      "score_fts": 0.73,
+      "title": "How to parse JSON in MySQL 8?",
+      "metadata": { "Tags": "<mysql><json>", "Score": "12" }
+    }
+  ],
+  "truncated": false,
+  "stats": {
+    "k_requested": 10,
+    "k_returned": 10,
+    "ms": 12
+  }
+}
+```
+
+---
+
+## 4. Tool: `rag.search_vector`
+
+Semantic search over `rag_vec_chunks`.
+
+### 4.1 Request schema (text input)
+```json
+{
+  "query_text": "How do I extract JSON fields in MySQL?",
+  "k": 10,
+  "filters": { },
+  "embedding": {
+    "model": "text-embedding-3-large"
+  }
+}
+```
+
+### 4.2 Request schema (precomputed vector)
+```json
+{
+  "query_embedding": {
+    "dim": 1536,
+    "values_b64": "AAAA..."  // float32 array packed and base64 encoded
+  },
+  "k": 10,
+  "filters": { }
+}
+```
+
+### 4.3 Semantics
+- If `query_text` is provided, ProxySQL computes embedding internally (preferred for agents).
+- If `query_embedding` is provided, ProxySQL uses it directly (useful for advanced clients).
+- Returns nearest chunks by distance/similarity.
+
+### 4.4 Response schema
+```json
+{
+  "results": [
+    {
+      "chunk_id": "posts:9876#1",
+      "doc_id": "posts:9876",
+      "source_id": 1,
+      "source_name": "stack_posts",
+      "score_vec": 0.82,
+      "title": "Query JSON columns efficiently",
+      "metadata": { "Tags": "<mysql><json>", "Score": "8" }
+    }
+  ],
+  "truncated": false,
+  "stats": {
+    "k_requested": 10,
+    "k_returned": 10,
+    "ms": 18
+  }
+}
+```
+
+---
+
+## 5. Tool: `rag.search_hybrid`
+
+Hybrid search combining FTS and vectors. Supports two modes:
+
+- **Mode A**: parallel FTS + vector, fuse results (RRF recommended)
+- **Mode B**: broad FTS candidate generation, then vector rerank
+
+### 5.1 Request schema (Mode A: fuse)
+```json
+{
+  "query": "json_extract mysql",
+  "k": 10,
+  "filters": { },
+  "mode": "fuse",
+  "fuse": {
+    "fts_k": 50,
+    "vec_k": 50,
+    "rrf_k0": 60,
+    "w_fts": 1.0,
+    "w_vec": 1.0
+  }
+}
+```
+
+### 5.2 Request schema (Mode B: candidates + rerank)
+```json
+{
+  "query": "json_extract mysql",
+  "k": 10,
+  "filters": { },
+  "mode": "fts_then_vec",
+  "fts_then_vec": {
+    "candidates_k": 200,
+    "rerank_k": 50,
+    "vec_metric": "cosine"
+  }
+}
+```
+
+### 5.3 Semantics (Mode A)
+1. Run FTS top `fts_k`
+2. Run vector top `vec_k`
+3. Merge candidates by `chunk_id`
+4. Compute fused score (RRF recommended)
+5. Return top `k`
+
+### 5.4 Semantics (Mode B)
+1. Run FTS top `candidates_k`
+2. Compute vector similarity within those candidates
+   - either by joining candidate chunk_ids to stored vectors, or
+   - by embedding candidate chunk text on the fly (not recommended)
+3. Return top `k` reranked results
+4. Optionally return debug info about candidate stages
+
+### 5.5 Response schema
+```json
+{
+  "results": [
+    {
+      "chunk_id": "posts:12345#0",
+      "doc_id": "posts:12345",
+      "source_id": 1,
+      "source_name": "stack_posts",
+      "score": 0.91,
+      "score_fts": 0.74,
+      "score_vec": 0.86,
+      "title": "How to parse JSON in MySQL 8?",
+      "metadata": { "Tags": "<mysql><json>", "Score": "12" },
+      "debug": {
+        "rank_fts": 3,
+        "rank_vec": 6
+      }
+    }
+  ],
+  "truncated": false,
+  "stats": {
+    "mode": "fuse",
+    "k_requested": 10,
+    "k_returned": 10,
+    "ms": 27
+  }
+}
+```
+
+---
+
+## 6. Tool: `rag.get_chunks`
+
+Fetch chunk bodies by chunk_id. This is how agents obtain grounding text.
+
+### 6.1 Request schema
+```json
+{
+  "chunk_ids": ["posts:12345#0", "posts:9876#1"],
+  "return": {
+    "include_title": true,
+    "include_doc_metadata": true,
+    "include_chunk_metadata": true
+  }
+}
+```
+
+### 6.2 Response schema
+```json
+{
+  "chunks": [
+    {
+      "chunk_id": "posts:12345#0",
+      "doc_id": "posts:12345",
+      "title": "How to parse JSON in MySQL 8?",
+      "body": "<p>I tried JSON_EXTRACT...</p>",
+      "doc_metadata": { "Tags": "<mysql><json>", "Score": "12" },
+      "chunk_metadata": { "chunk_index": 0 }
+    }
+  ],
+  "truncated": false,
+  "stats": { "ms": 6 }
+}
+```
+
+**Hard limit recommendation**
+- Cap total returned chunk bytes to a safe maximum (e.g. 1–2 MB).
+
+---
+
+## 7. Tool: `rag.get_docs`
+
+Fetch full canonical documents by doc_id (not chunks). Useful for inspection or compact docs.
+
+### 7.1 Request schema
+```json
+{
+  "doc_ids": ["posts:12345"],
+  "return": {
+    "include_body": true,
+    "include_metadata": true
+  }
+}
+```
+
+### 7.2 Response schema
+```json
+{
+  "docs": [
+    {
+      "doc_id": "posts:12345",
+      "source_id": 1,
+      "source_name": "stack_posts",
+      "pk_json": { "Id": 12345 },
+      "title": "How to parse JSON in MySQL 8?",
+      "body": "<p>...</p>",
+      "metadata": { "Tags": "<mysql><json>", "Score": "12" }
+    }
+  ],
+  "truncated": false,
+  "stats": { "ms": 7 }
+}
+```
+
+---
+
+## 8. Tool: `rag.fetch_from_source`
+
+Refetch authoritative rows from the source DB using `doc_id` (via pk_json).
+
+### 8.1 Request schema
+```json
+{
+  "doc_ids": ["posts:12345"],
+  "columns": ["Id","Title","Body","Tags","Score"],
+  "limits": {
+    "max_rows": 10,
+    "max_bytes": 200000
+  }
+}
+```
+
+### 8.2 Semantics
+- Look up doc(s) in `rag_documents` to get `source_id` and `pk_json`
+- Resolve source connection from `rag_sources`
+- Execute a parameterized query by primary key
+- Return requested columns only
+- Enforce strict limits
+
+### 8.3 Response schema
+```json
+{
+  "rows": [
+    {
+      "doc_id": "posts:12345",
+      "source_name": "stack_posts",
+      "row": {
+        "Id": 12345,
+        "Title": "How to parse JSON in MySQL 8?",
+        "Score": 12
+      }
+    }
+  ],
+  "truncated": false,
+  "stats": { "ms": 22 }
+}
+```
+
+**Security note**
+- This tool must not allow arbitrary SQL.
+- Only allow fetching by primary key and a whitelist of columns.
+
+---
+
+## 9. Tool: `rag.admin.stats` (recommended)
+
+Operational visibility for dashboards and debugging.
+
+### 9.1 Request
+```json
+{}
+```
+
+### 9.2 Response
+```json
+{
+  "sources": [
+    {
+      "source_id": 1,
+      "source_name": "stack_posts",
+      "docs": 123456,
+      "chunks": 456789,
+      "last_sync": null
+    }
+  ],
+  "stats": { "ms": 5 }
+}
+```
+
+---
+
+## 10. Tool: `rag.admin.sync` (optional in v0; required in v1)
+
+Kicks ingestion for a source or all sources. In v0, ingestion may run as a separate process; in ProxySQL product form, this would trigger an internal job.
+
+### 10.1 Request
+```json
+{
+  "source_names": ["stack_posts"]
+}
+```
+
+### 10.2 Response
+```json
+{
+  "accepted": true,
+  "job_id": "sync-2026-01-19T10:00:00Z"
+}
+```
+
+---
+
+## 11. Implementation notes (what the coding agent should implement)
+
+1. **Input validation and caps** for every tool.
+2. **Consistent filtering** across FTS/vector/hybrid.
+3. **Stable scoring semantics** (higher-is-better recommended).
+4. **Efficient joins**:
+   - vector search returns chunk_ids; join to `rag_chunks`/`rag_documents` for metadata.
+5. **Hybrid modes**:
+   - Mode A (fuse): implement RRF
+   - Mode B (fts_then_vec): candidate set then vector rerank
+6. **Error model**:
+   - return structured errors with codes (e.g. `INVALID_ARGUMENT`, `LIMIT_EXCEEDED`, `INTERNAL`)
+7. **Observability**:
+   - return `stats.ms` in responses
+   - track tool usage counters and latency histograms
+
+---
+
+## 12. Summary
+
+These MCP tools define a stable retrieval interface:
+
+- Search: `rag.search_fts`, `rag.search_vector`, `rag.search_hybrid`
+- Fetch: `rag.get_chunks`, `rag.get_docs`, `rag.fetch_from_source`
+- Admin: `rag.admin.stats`, optionally `rag.admin.sync`
+
--- a/RAG_POC/rag_ingest.cpp
+++ b/RAG_POC/rag_ingest.cpp
--- a/RAG_POC/schema.sql
+++ b/RAG_POC/schema.sql
@ -0,0 +1,172 @@
+-- ============================================================
+-- ProxySQL RAG Index Schema (SQLite)
+-- v0: documents + chunks + FTS5 + sqlite3-vec embeddings
+-- ============================================================
+
+PRAGMA foreign_keys = ON;
+PRAGMA journal_mode = WAL;
+PRAGMA synchronous  = NORMAL;
+
+-- ============================================================
+-- 1) rag_sources: control plane
+-- Defines where to fetch from + how to transform + chunking.
+-- ============================================================
+CREATE TABLE IF NOT EXISTS rag_sources (
+  source_id        INTEGER PRIMARY KEY,
+  name             TEXT NOT NULL UNIQUE,     -- e.g. "stack_posts"
+  enabled          INTEGER NOT NULL DEFAULT 1,
+
+  -- Where to retrieve from (PoC: connect directly; later can be "via ProxySQL")
+  backend_type     TEXT NOT NULL,            -- "mysql" | "postgres" | ...
+  backend_host     TEXT NOT NULL,
+  backend_port     INTEGER NOT NULL,
+  backend_user     TEXT NOT NULL,
+  backend_pass     TEXT NOT NULL,
+  backend_db       TEXT NOT NULL,            -- database/schema name
+
+  table_name       TEXT NOT NULL,            -- e.g. "posts"
+  pk_column        TEXT NOT NULL,            -- e.g. "Id"
+
+  -- Optional: restrict ingestion; appended to SELECT as WHERE <where_sql>
+  where_sql        TEXT,                     -- e.g. "PostTypeId IN (1,2)"
+
+  -- REQUIRED: mapping from source row -> rag_documents fields
+  -- JSON spec describing doc_id, title/body concat, metadata pick/rename, etc.
+  doc_map_json     TEXT NOT NULL,
+
+  -- REQUIRED: chunking strategy (enabled, chunk_size, overlap, etc.)
+  chunking_json    TEXT NOT NULL,
+
+  -- Optional: embedding strategy (how to build embedding input text)
+  -- In v0 you can keep it NULL/empty; define later without schema changes.
+  embedding_json   TEXT,
+
+  created_at       INTEGER NOT NULL DEFAULT (unixepoch()),
+  updated_at       INTEGER NOT NULL DEFAULT (unixepoch())
+);
+
+CREATE INDEX IF NOT EXISTS idx_rag_sources_enabled
+  ON rag_sources(enabled);
+
+CREATE INDEX IF NOT EXISTS idx_rag_sources_backend
+  ON rag_sources(backend_type, backend_host, backend_port, backend_db, table_name);
+
+
+-- ============================================================
+-- 2) rag_documents: canonical documents
+-- One document per source row (e.g. one per posts.Id).
+-- ============================================================
+CREATE TABLE IF NOT EXISTS rag_documents (
+  doc_id         TEXT PRIMARY KEY,          -- stable: e.g. "posts:12345"
+  source_id      INTEGER NOT NULL REFERENCES rag_sources(source_id),
+  source_name    TEXT NOT NULL,             -- copy of rag_sources.name for convenience
+  pk_json        TEXT NOT NULL,             -- e.g. {"Id":12345}
+
+  title          TEXT,
+  body           TEXT,
+  metadata_json  TEXT NOT NULL DEFAULT '{}', -- JSON object
+
+  updated_at     INTEGER NOT NULL DEFAULT (unixepoch()),
+  deleted        INTEGER NOT NULL DEFAULT 0
+);
+
+CREATE INDEX IF NOT EXISTS idx_rag_documents_source_updated
+  ON rag_documents(source_id, updated_at);
+
+CREATE INDEX IF NOT EXISTS idx_rag_documents_source_deleted
+  ON rag_documents(source_id, deleted);
+
+
+-- ============================================================
+-- 3) rag_chunks: chunked content
+-- The unit we index in FTS and vectors.
+-- ============================================================
+CREATE TABLE IF NOT EXISTS rag_chunks (
+  chunk_id       TEXT PRIMARY KEY,          -- e.g. "posts:12345#0"
+  doc_id         TEXT NOT NULL REFERENCES rag_documents(doc_id),
+  source_id      INTEGER NOT NULL REFERENCES rag_sources(source_id),
+
+  chunk_index    INTEGER NOT NULL,          -- 0..N-1
+  title          TEXT,
+  body           TEXT NOT NULL,
+
+  -- Optional per-chunk metadata (e.g. offsets, has_code, section label)
+  metadata_json  TEXT NOT NULL DEFAULT '{}',
+
+  updated_at     INTEGER NOT NULL DEFAULT (unixepoch()),
+  deleted        INTEGER NOT NULL DEFAULT 0
+);
+
+CREATE UNIQUE INDEX IF NOT EXISTS uq_rag_chunks_doc_idx
+  ON rag_chunks(doc_id, chunk_index);
+
+CREATE INDEX IF NOT EXISTS idx_rag_chunks_source_doc
+  ON rag_chunks(source_id, doc_id);
+
+CREATE INDEX IF NOT EXISTS idx_rag_chunks_deleted
+  ON rag_chunks(deleted);
+
+
+-- ============================================================
+-- 4) rag_fts_chunks: FTS5 index (contentless)
+-- Maintained explicitly by the ingester.
+-- Notes:
+--   - chunk_id is stored but UNINDEXED.
+--   - Use bm25(rag_fts_chunks) for ranking.
+-- ============================================================
+CREATE VIRTUAL TABLE IF NOT EXISTS rag_fts_chunks
+USING fts5(
+  chunk_id UNINDEXED,
+  title,
+  body,
+  tokenize = 'unicode61'
+);
+
+
+-- ============================================================
+-- 5) rag_vec_chunks: sqlite3-vec index
+-- Stores embeddings per chunk for vector search.
+--
+-- IMPORTANT:
+--   - dimension must match your embedding model (example: 1536).
+--   - metadata columns are included to help join/filter.
+-- ============================================================
+CREATE VIRTUAL TABLE IF NOT EXISTS rag_vec_chunks
+USING vec0(
+  embedding  float[1536],   -- change if you use another dimension
+  chunk_id   TEXT,          -- join key back to rag_chunks
+  doc_id     TEXT,          -- optional convenience
+  source_id  INTEGER,       -- optional convenience
+  updated_at INTEGER        -- optional convenience
+);
+
+-- Optional: convenience view for debugging / SQL access patterns
+CREATE VIEW IF NOT EXISTS rag_chunk_view AS
+SELECT
+  c.chunk_id,
+  c.doc_id,
+  c.source_id,
+  d.source_name,
+  d.pk_json,
+  COALESCE(c.title, d.title) AS title,
+  c.body,
+  d.metadata_json AS doc_metadata_json,
+  c.metadata_json AS chunk_metadata_json,
+  c.updated_at
+FROM rag_chunks c
+JOIN rag_documents d ON d.doc_id = c.doc_id
+WHERE c.deleted = 0 AND d.deleted = 0;
+
+
+-- ============================================================
+-- 6) (Optional) sync state placeholder for later incremental ingestion
+-- Not used in v0, but reserving it avoids later schema churn.
+-- ============================================================
+CREATE TABLE IF NOT EXISTS rag_sync_state (
+  source_id     INTEGER PRIMARY KEY REFERENCES rag_sources(source_id),
+  mode          TEXT NOT NULL DEFAULT 'poll', -- 'poll' | 'cdc'
+  cursor_json   TEXT NOT NULL DEFAULT '{}',   -- watermark/checkpoint
+  last_ok_at    INTEGER,
+  last_error    TEXT
+);
+
--- a/RAG_POC/sql-examples.md
+++ b/RAG_POC/sql-examples.md
@ -0,0 +1,348 @@
+# ProxySQL RAG Index — SQL Examples (FTS, Vectors, Hybrid)
+
+This file provides concrete SQL examples for querying the ProxySQL-hosted SQLite RAG index directly (for debugging, internal dashboards, or SQL-native applications).
+
+The **preferred interface for AI agents** remains MCP tools (`mcp-tools.md`). SQL access should typically be restricted to trusted callers.
+
+Assumed tables:
+- `rag_documents`
+- `rag_chunks`
+- `rag_fts_chunks` (FTS5)
+- `rag_vec_chunks` (sqlite3-vec vec0 table)
+
+---
+
+## 0. Common joins and inspection
+
+### 0.1 Inspect one document and its chunks
+```sql
+SELECT * FROM rag_documents WHERE doc_id = 'posts:12345';
+SELECT * FROM rag_chunks WHERE doc_id = 'posts:12345' ORDER BY chunk_index;
+```
+
+### 0.2 Use the convenience view (if enabled)
+```sql
+SELECT * FROM rag_chunk_view WHERE doc_id = 'posts:12345' ORDER BY chunk_id;
+```
+
+---
+
+## 1. FTS5 examples
+
+### 1.1 Basic FTS search (top 10)
+```sql
+SELECT
+  f.chunk_id,
+  bm25(rag_fts_chunks) AS score_fts_raw
+FROM rag_fts_chunks f
+WHERE rag_fts_chunks MATCH 'json_extract mysql'
+ORDER BY score_fts_raw
+LIMIT 10;
+```
+
+### 1.2 Join FTS results to chunk text and document metadata
+```sql
+SELECT
+  f.chunk_id,
+  bm25(rag_fts_chunks) AS score_fts_raw,
+  c.doc_id,
+  COALESCE(c.title, d.title) AS title,
+  c.body AS chunk_body,
+  d.metadata_json AS doc_metadata_json
+FROM rag_fts_chunks f
+JOIN rag_chunks c ON c.chunk_id = f.chunk_id
+JOIN rag_documents d ON d.doc_id = c.doc_id
+WHERE rag_fts_chunks MATCH 'json_extract mysql'
+  AND c.deleted = 0 AND d.deleted = 0
+ORDER BY score_fts_raw
+LIMIT 10;
+```
+
+### 1.3 Apply a source filter (by source_id)
+```sql
+SELECT
+  f.chunk_id,
+  bm25(rag_fts_chunks) AS score_fts_raw
+FROM rag_fts_chunks f
+JOIN rag_chunks c ON c.chunk_id = f.chunk_id
+WHERE rag_fts_chunks MATCH 'replication lag'
+  AND c.source_id = 1
+ORDER BY score_fts_raw
+LIMIT 20;
+```
+
+### 1.4 Phrase queries, boolean operators (FTS5)
+```sql
+-- phrase
+SELECT chunk_id FROM rag_fts_chunks
+WHERE rag_fts_chunks MATCH '"group replication"'
+LIMIT 20;
+
+-- boolean: term1 AND term2
+SELECT chunk_id FROM rag_fts_chunks
+WHERE rag_fts_chunks MATCH 'mysql AND deadlock'
+LIMIT 20;
+
+-- boolean: term1 NOT term2
+SELECT chunk_id FROM rag_fts_chunks
+WHERE rag_fts_chunks MATCH 'mysql NOT mariadb'
+LIMIT 20;
+```
+
+---
+
+## 2. Vector search examples (sqlite3-vec)
+
+Vector SQL varies slightly depending on sqlite3-vec build and how you bind vectors.
+Below are **two patterns** you can implement in ProxySQL.
+
+### 2.1 Pattern A (recommended): ProxySQL computes embeddings; SQL receives a bound vector
+In this pattern, ProxySQL:
+1) Computes the query embedding in C++
+2) Executes SQL with a bound parameter `:qvec` representing the embedding
+
+A typical “nearest neighbors” query shape is:
+
+```sql
+-- PSEUDOCODE: adapt to sqlite3-vec's exact operator/function in your build.
+SELECT
+  v.chunk_id,
+  v.distance AS distance_raw
+FROM rag_vec_chunks v
+WHERE v.embedding MATCH :qvec
+ORDER BY distance_raw
+LIMIT 10;
+```
+
+Then join to chunks:
+```sql
+-- PSEUDOCODE: join with content and metadata
+SELECT
+  v.chunk_id,
+  v.distance AS distance_raw,
+  c.doc_id,
+  c.body AS chunk_body,
+  d.metadata_json AS doc_metadata_json
+FROM (
+  SELECT chunk_id, distance
+  FROM rag_vec_chunks
+  WHERE embedding MATCH :qvec
+  ORDER BY distance
+  LIMIT 10
+) v
+JOIN rag_chunks c ON c.chunk_id = v.chunk_id
+JOIN rag_documents d ON d.doc_id = c.doc_id;
+```
+
+### 2.2 Pattern B (debug): store a query vector in a temporary table
+This is useful when you want to run vector queries manually in SQL without MCP support.
+
+```sql
+CREATE TEMP TABLE tmp_query_vec(qvec BLOB);
+-- Insert the query vector (float32 array blob). The insertion is usually done by tooling, not manually.
+-- INSERT INTO tmp_query_vec VALUES (X'...');
+
+-- PSEUDOCODE: use tmp_query_vec.qvec as the query embedding
+SELECT
+  v.chunk_id,
+  v.distance
+FROM rag_vec_chunks v, tmp_query_vec t
+WHERE v.embedding MATCH t.qvec
+ORDER BY v.distance
+LIMIT 10;
+```
+
+---
+
+## 3. Hybrid search examples
+
+Hybrid retrieval is best implemented in the MCP layer because it mixes ranking systems and needs careful bounding.
+However, you can approximate hybrid behavior using SQL to validate logic.
+
+### 3.1 Hybrid Mode A: Parallel FTS + Vector then fuse (RRF)
+
+#### Step 1: FTS top 50 (ranked)
+```sql
+WITH fts AS (
+  SELECT
+    f.chunk_id,
+    bm25(rag_fts_chunks) AS score_fts_raw
+  FROM rag_fts_chunks f
+  WHERE rag_fts_chunks MATCH :fts_query
+  ORDER BY score_fts_raw
+  LIMIT 50
+)
+SELECT * FROM fts;
+```
+
+#### Step 2: Vector top 50 (ranked)
+```sql
+WITH vec AS (
+  SELECT
+    v.chunk_id,
+    v.distance AS distance_raw
+  FROM rag_vec_chunks v
+  WHERE v.embedding MATCH :qvec
+  ORDER BY v.distance
+  LIMIT 50
+)
+SELECT * FROM vec;
+```
+
+#### Step 3: Fuse via Reciprocal Rank Fusion (RRF)
+In SQL you need ranks. SQLite supports window functions in modern builds.
+
+```sql
+WITH
+fts AS (
+  SELECT
+    f.chunk_id,
+    bm25(rag_fts_chunks) AS score_fts_raw,
+    ROW_NUMBER() OVER (ORDER BY bm25(rag_fts_chunks)) AS rank_fts
+  FROM rag_fts_chunks f
+  WHERE rag_fts_chunks MATCH :fts_query
+  LIMIT 50
+),
+vec AS (
+  SELECT
+    v.chunk_id,
+    v.distance AS distance_raw,
+    ROW_NUMBER() OVER (ORDER BY v.distance) AS rank_vec
+  FROM rag_vec_chunks v
+  WHERE v.embedding MATCH :qvec
+  LIMIT 50
+),
+merged AS (
+  SELECT
+    COALESCE(fts.chunk_id, vec.chunk_id) AS chunk_id,
+    fts.rank_fts,
+    vec.rank_vec,
+    fts.score_fts_raw,
+    vec.distance_raw
+  FROM fts
+  FULL OUTER JOIN vec ON vec.chunk_id = fts.chunk_id
+),
+rrf AS (
+  SELECT
+    chunk_id,
+    score_fts_raw,
+    distance_raw,
+    rank_fts,
+    rank_vec,
+    (1.0 / (60.0 + COALESCE(rank_fts, 1000000))) +
+    (1.0 / (60.0 + COALESCE(rank_vec, 1000000))) AS score_rrf
+  FROM merged
+)
+SELECT
+  r.chunk_id,
+  r.score_rrf,
+  c.doc_id,
+  c.body AS chunk_body
+FROM rrf r
+JOIN rag_chunks c ON c.chunk_id = r.chunk_id
+ORDER BY r.score_rrf DESC
+LIMIT 10;
+```
+
+**Important**: SQLite does not support `FULL OUTER JOIN` directly in all builds.
+For production, implement the merge/fuse in C++ (MCP layer). This SQL is illustrative.
+
+### 3.2 Hybrid Mode B: Broad FTS then vector rerank (candidate generation)
+
+#### Step 1: FTS candidate set (top 200)
+```sql
+WITH candidates AS (
+  SELECT
+    f.chunk_id,
+    bm25(rag_fts_chunks) AS score_fts_raw
+  FROM rag_fts_chunks f
+  WHERE rag_fts_chunks MATCH :fts_query
+  ORDER BY score_fts_raw
+  LIMIT 200
+)
+SELECT * FROM candidates;
+```
+
+#### Step 2: Vector rerank within candidates
+Conceptually:
+- Join candidates to `rag_vec_chunks` and compute distance to `:qvec`
+- Keep top 10
+
+```sql
+WITH candidates AS (
+  SELECT
+    f.chunk_id
+  FROM rag_fts_chunks f
+  WHERE rag_fts_chunks MATCH :fts_query
+  ORDER BY bm25(rag_fts_chunks)
+  LIMIT 200
+),
+reranked AS (
+  SELECT
+    v.chunk_id,
+    v.distance AS distance_raw
+  FROM rag_vec_chunks v
+  JOIN candidates c ON c.chunk_id = v.chunk_id
+  WHERE v.embedding MATCH :qvec
+  ORDER BY v.distance
+  LIMIT 10
+)
+SELECT
+  r.chunk_id,
+  r.distance_raw,
+  ch.doc_id,
+  ch.body
+FROM reranked r
+JOIN rag_chunks ch ON ch.chunk_id = r.chunk_id;
+```
+
+As above, the exact `MATCH :qvec` syntax may need adaptation to your sqlite3-vec build; implement vector query execution in C++ and keep SQL as internal glue.
+
+---
+
+## 4. Common “application-friendly” queries
+
+### 4.1 Return doc_id + score + title only (no bodies)
+```sql
+SELECT
+  f.chunk_id,
+  c.doc_id,
+  COALESCE(c.title, d.title) AS title,
+  bm25(rag_fts_chunks) AS score_fts_raw
+FROM rag_fts_chunks f
+JOIN rag_chunks c ON c.chunk_id = f.chunk_id
+JOIN rag_documents d ON d.doc_id = c.doc_id
+WHERE rag_fts_chunks MATCH :q
+ORDER BY score_fts_raw
+LIMIT 20;
+```
+
+### 4.2 Return top doc_ids (deduplicate by doc_id)
+```sql
+WITH ranked_chunks AS (
+  SELECT
+    c.doc_id,
+    bm25(rag_fts_chunks) AS score_fts_raw
+  FROM rag_fts_chunks f
+  JOIN rag_chunks c ON c.chunk_id = f.chunk_id
+  WHERE rag_fts_chunks MATCH :q
+  ORDER BY score_fts_raw
+  LIMIT 200
+)
+SELECT doc_id, MIN(score_fts_raw) AS best_score
+FROM ranked_chunks
+GROUP BY doc_id
+ORDER BY best_score
+LIMIT 20;
+```
+
+---
+
+## 5. Practical guidance
+
+- Use SQL mode mainly for debugging and internal tooling.
+- Prefer MCP tools for agent interaction:
+  - stable schemas
+  - strong guardrails
+  - consistent hybrid scoring
+- Implement hybrid fusion in C++ (not in SQL) to avoid dialect limitations and to keep scoring correct.