# ProxySQL RAG Index — Embeddings & Vector Retrieval Design (Chunk-Level) (v0→v1 Blueprint) This document specifies how embeddings should be produced, stored, updated, and queried for chunk-level vector search in ProxySQL’s RAG index. It is intended as an implementation blueprint. It assumes: - Chunking is already implemented (`rag_chunks`). - ProxySQL includes **sqlite3-vec** and uses a `vec0(...)` virtual table (`rag_vec_chunks`). - Retrieval is exposed primarily via MCP tools (`mcp-tools.md`). --- ## 1. Design objectives 1. **Chunk-level embeddings** - Each chunk receives its own embedding for retrieval precision. 2. **Deterministic embedding input** - The text embedded is explicitly defined per source, not inferred. 3. **Model agility** - The system can change embedding models/dimensions without breaking stored data or APIs. 4. **Efficient updates** - Only recompute embeddings for chunks whose embedding input changed. 5. **Operational safety** - Bound cost and latency (embedding generation can be expensive). - Allow asynchronous embedding jobs if needed later. --- ## 2. What to embed (and what not to embed) ### 2.1 Embed text that improves semantic retrieval Recommended embedding input per chunk: - Document title (if present) - Tags (as plain text) - Chunk body Example embedding input template: ``` {Title} Tags: {Tags} {ChunkBody} ``` This typically improves semantic recall significantly for knowledge-base-like content (StackOverflow posts, docs, tickets, runbooks). ### 2.2 Do NOT embed numeric metadata by default Do not embed fields like `Score`, `ViewCount`, `OwnerUserId`, timestamps, etc. These should remain structured and be used for: - filtering - boosting - tie-breaking - result shaping Embedding numeric metadata into text typically adds noise and reduces semantic quality. ### 2.3 Code and HTML considerations If your chunk body contains HTML or code: - **v0**: embed raw text (works, but may be noisy) - **v1**: normalize to improve quality: - strip HTML tags (keep text content) - preserve code blocks as text, but consider stripping excessive markup - optionally create specialized “code-only” chunks for code-heavy sources Normalization should be source-configurable. --- ## 3. Where embedding input rules are defined Embedding input rules must be explicit and stored per source. ### 3.1 `rag_sources.embedding_json` Recommended schema: ```json { "enabled": true, "model": "text-embedding-3-large", "dim": 1536, "input": { "concat": [ {"col":"Title"}, {"lit":"\nTags: "}, {"col":"Tags"}, {"lit":"\n\n"}, {"chunk_body": true} ] }, "normalize": { "strip_html": true, "collapse_whitespace": true } } ``` **Semantics** - `enabled`: whether to compute/store embeddings for this source - `model`: logical name (for observability and compatibility checks) - `dim`: vector dimension - `input.concat`: how to build embedding input text - `normalize`: optional normalization steps --- ## 4. Storage schema and model/versioning ### 4.1 Current v0 schema: single vector table `rag_vec_chunks` stores: - embedding vector - chunk_id - doc_id/source_id convenience columns - updated_at This is appropriate for v0 when you assume a single embedding model/dimension. ### 4.2 Recommended v1 evolution: support multiple models In a product setting, you may want multiple embedding models (e.g. general vs code-centric). Two ways to support this: #### Option A: include model identity columns in `rag_vec_chunks` Add columns: - `model TEXT` - `dim INTEGER` (optional if fixed per model) Then allow multiple rows per `chunk_id` (unique key becomes `(chunk_id, model)`). This may require schema change and a different vec0 design (some vec0 configurations support metadata columns, but uniqueness must be handled carefully). #### Option B: one vec table per model (recommended if vec0 constraints exist) Create: - `rag_vec_chunks_1536_v1` - `rag_vec_chunks_1024_code_v1` etc. Then MCP tools select the table based on requested model or default configuration. **Recommendation** Start with Option A only if your sqlite3-vec build makes it easy to filter by model. Otherwise, Option B is operationally cleaner. --- ## 5. Embedding generation pipeline ### 5.1 When embeddings are created Embeddings are created during ingestion, immediately after chunk creation, if `embedding_json.enabled=true`. This provides a simple, synchronous pipeline: - ingest row → create chunks → compute embedding → store vector ### 5.2 When embeddings should be updated Embeddings must be recomputed if the *embedding input string* changes. That depends on: - title changes - tags changes - chunk body changes - normalization rules changes (strip_html etc.) - embedding model changes Therefore, update logic should be based on a **content hash** of the embedding input. --- ## 6. Content hashing for efficient updates (v1 recommendation) ### 6.1 Why hashing is needed Without hashing, you might recompute embeddings unnecessarily: - expensive - slow - prevents incremental sync from being efficient ### 6.2 Recommended approach Store `embedding_input_hash` per chunk per model. Implementation options: #### Option A: Store hash in `rag_chunks.metadata_json` Example: ```json { "chunk_index": 0, "embedding_hash": "sha256:...", "embedding_model": "text-embedding-3-large" } ``` Pros: no schema changes. Cons: JSON parsing overhead. #### Option B: Dedicated side table (recommended) Create `rag_chunk_embedding_state`: ```sql CREATE TABLE rag_chunk_embedding_state ( chunk_id TEXT NOT NULL, model TEXT NOT NULL, dim INTEGER NOT NULL, input_hash TEXT NOT NULL, updated_at INTEGER NOT NULL DEFAULT (unixepoch()), PRIMARY KEY(chunk_id, model) ); ``` Pros: fast lookups; avoids JSON parsing. Cons: extra table. **Recommendation** Use Option B for v1. --- ## 7. Embedding model integration options ### 7.1 External embedding service (recommended initially) ProxySQL calls an embedding service: - OpenAI-compatible endpoint, or - local service (e.g. llama.cpp server), or - vendor-specific embedding API Pros: - easy to iterate on model choice - isolates ML runtime from ProxySQL process Cons: - network latency; requires caching and timeouts ### 7.2 Embedded model runtime inside ProxySQL ProxySQL links to an embedding runtime (llama.cpp, etc.) Pros: - no network dependency - predictable latency if tuned Cons: - increases memory footprint - needs careful resource controls **Recommendation** Start with an external embedding provider and keep a modular interface that can be swapped later. --- ## 8. Query embedding generation Vector search needs a query embedding. Do this in the MCP layer: 1. Take `query_text` 2. Apply query normalization (optional but recommended) 3. Compute query embedding using the same model used for chunks 4. Execute vector search SQL with a bound embedding vector **Do not** - accept arbitrary embedding vectors from untrusted callers without validation - allow unbounded query lengths --- ## 9. Vector search semantics ### 9.1 Distance vs similarity Depending on the embedding model and vec search primitive, vector search may return: - cosine distance (lower is better) - cosine similarity (higher is better) - L2 distance (lower is better) **Recommendation** Normalize to a “higher is better” score in MCP responses: - if distance: `score_vec = 1 / (1 + distance)` or similar monotonic transform Keep raw distance in debug fields if needed. ### 9.2 Filtering Filtering should be supported by: - `source_id` restriction - optional metadata filters (doc-level or chunk-level) In v0, filter by `source_id` is easiest because `rag_vec_chunks` stores `source_id` as metadata. --- ## 10. Hybrid retrieval integration Embeddings are one leg of hybrid retrieval. Two recommended hybrid modes are described in `mcp-tools.md`: 1. **Fuse**: top-N FTS and top-N vector, merged by chunk_id, fused by RRF 2. **FTS then vector**: broad FTS candidates then vector rerank within candidates Embeddings support both: - Fuse mode needs global vector search top-N. - Candidate mode needs vector search restricted to candidate chunk IDs. Candidate mode is often cheaper and more precise when the query includes strong exact tokens. --- ## 11. Operational controls ### 11.1 Resource limits Embedding generation must be bounded by: - max chunk size embedded - max chunks embedded per document - per-source embedding rate limit - timeouts when calling embedding provider ### 11.2 Batch embedding To improve throughput, embed in batches: - collect N chunks - send embedding request for N inputs - store results ### 11.3 Backpressure and async embedding For v1, consider decoupling embedding generation from ingestion: - ingestion stores chunks - embedding worker processes “pending” chunks and fills vectors This allows: - ingestion to remain fast - embedding to scale independently - retries on embedding failures In this design, store a state record: - pending / ok / error - last error message - retry count --- ## 12. Recommended implementation steps (coding agent checklist) ### v0 (synchronous embedding) 1. Implement `embedding_json` parsing in ingester 2. Build embedding input string for each chunk 3. Call embedding provider (or use a stub in development) 4. Insert vector rows into `rag_vec_chunks` 5. Implement `rag.search_vector` MCP tool using query embedding + vector SQL ### v1 (efficient incremental embedding) 1. Add `rag_chunk_embedding_state` table 2. Store `input_hash` per chunk per model 3. Only re-embed if hash changed 4. Add async embedding worker option 5. Add metrics for embedding throughput and failures --- ## 13. Summary - Compute embeddings per chunk, not per document. - Define embedding input explicitly in `rag_sources.embedding_json`. - Store vectors in `rag_vec_chunks` (vec0). - For production, add hash-based update detection and optional async embedding workers. - Normalize vector scores in MCP responses and keep raw distance for debugging.