9.9 KiB
ProxySQL RAG Index — Embeddings & Vector Retrieval Design (Chunk-Level) (v0→v1 Blueprint)
This document specifies how embeddings should be produced, stored, updated, and queried for chunk-level vector search in ProxySQL’s RAG index. It is intended as an implementation blueprint.
It assumes:
- Chunking is already implemented (
rag_chunks). - ProxySQL includes sqlite3-vec and uses a
vec0(...)virtual table (rag_vec_chunks). - Retrieval is exposed primarily via MCP tools (
mcp-tools.md).
1. Design objectives
-
Chunk-level embeddings
- Each chunk receives its own embedding for retrieval precision.
-
Deterministic embedding input
- The text embedded is explicitly defined per source, not inferred.
-
Model agility
- The system can change embedding models/dimensions without breaking stored data or APIs.
-
Efficient updates
- Only recompute embeddings for chunks whose embedding input changed.
-
Operational safety
- Bound cost and latency (embedding generation can be expensive).
- Allow asynchronous embedding jobs if needed later.
2. What to embed (and what not to embed)
2.1 Embed text that improves semantic retrieval
Recommended embedding input per chunk:
- Document title (if present)
- Tags (as plain text)
- Chunk body
Example embedding input template:
{Title}
Tags: {Tags}
{ChunkBody}
This typically improves semantic recall significantly for knowledge-base-like content (StackOverflow posts, docs, tickets, runbooks).
2.2 Do NOT embed numeric metadata by default
Do not embed fields like Score, ViewCount, OwnerUserId, timestamps, etc. These should remain structured and be used for:
- filtering
- boosting
- tie-breaking
- result shaping
Embedding numeric metadata into text typically adds noise and reduces semantic quality.
2.3 Code and HTML considerations
If your chunk body contains HTML or code:
- v0: embed raw text (works, but may be noisy)
- v1: normalize to improve quality:
- strip HTML tags (keep text content)
- preserve code blocks as text, but consider stripping excessive markup
- optionally create specialized “code-only” chunks for code-heavy sources
Normalization should be source-configurable.
3. Where embedding input rules are defined
Embedding input rules must be explicit and stored per source.
3.1 rag_sources.embedding_json
Recommended schema:
{
"enabled": true,
"model": "text-embedding-3-large",
"dim": 1536,
"input": {
"concat": [
{"col":"Title"},
{"lit":"\nTags: "}, {"col":"Tags"},
{"lit":"\n\n"},
{"chunk_body": true}
]
},
"normalize": {
"strip_html": true,
"collapse_whitespace": true
}
}
Semantics
enabled: whether to compute/store embeddings for this sourcemodel: logical name (for observability and compatibility checks)dim: vector dimensioninput.concat: how to build embedding input textnormalize: optional normalization steps
4. Storage schema and model/versioning
4.1 Current v0 schema: single vector table
rag_vec_chunks stores:
- embedding vector
- chunk_id
- doc_id/source_id convenience columns
- updated_at
This is appropriate for v0 when you assume a single embedding model/dimension.
4.2 Recommended v1 evolution: support multiple models
In a product setting, you may want multiple embedding models (e.g. general vs code-centric).
Two ways to support this:
Option A: include model identity columns in rag_vec_chunks
Add columns:
model TEXTdim INTEGER(optional if fixed per model)
Then allow multiple rows per chunk_id (unique key becomes (chunk_id, model)).
This may require schema change and a different vec0 design (some vec0 configurations support metadata columns, but uniqueness must be handled carefully).
Option B: one vec table per model (recommended if vec0 constraints exist)
Create:
rag_vec_chunks_1536_v1rag_vec_chunks_1024_code_v1etc.
Then MCP tools select the table based on requested model or default configuration.
Recommendation Start with Option A only if your sqlite3-vec build makes it easy to filter by model. Otherwise, Option B is operationally cleaner.
5. Embedding generation pipeline
5.1 When embeddings are created
Embeddings are created during ingestion, immediately after chunk creation, if embedding_json.enabled=true.
This provides a simple, synchronous pipeline:
- ingest row → create chunks → compute embedding → store vector
5.2 When embeddings should be updated
Embeddings must be recomputed if the embedding input string changes. That depends on:
- title changes
- tags changes
- chunk body changes
- normalization rules changes (strip_html etc.)
- embedding model changes
Therefore, update logic should be based on a content hash of the embedding input.
6. Content hashing for efficient updates (v1 recommendation)
6.1 Why hashing is needed
Without hashing, you might recompute embeddings unnecessarily:
- expensive
- slow
- prevents incremental sync from being efficient
6.2 Recommended approach
Store embedding_input_hash per chunk per model.
Implementation options:
Option A: Store hash in rag_chunks.metadata_json
Example:
{
"chunk_index": 0,
"embedding_hash": "sha256:...",
"embedding_model": "text-embedding-3-large"
}
Pros: no schema changes.
Cons: JSON parsing overhead.
Option B: Dedicated side table (recommended)
Create rag_chunk_embedding_state:
CREATE TABLE rag_chunk_embedding_state (
chunk_id TEXT NOT NULL,
model TEXT NOT NULL,
dim INTEGER NOT NULL,
input_hash TEXT NOT NULL,
updated_at INTEGER NOT NULL DEFAULT (unixepoch()),
PRIMARY KEY(chunk_id, model)
);
Pros: fast lookups; avoids JSON parsing.
Cons: extra table.
Recommendation Use Option B for v1.
7. Embedding model integration options
7.1 External embedding service (recommended initially)
ProxySQL calls an embedding service:
- OpenAI-compatible endpoint, or
- local service (e.g. llama.cpp server), or
- vendor-specific embedding API
Pros:
- easy to iterate on model choice
- isolates ML runtime from ProxySQL process
Cons:
- network latency; requires caching and timeouts
7.2 Embedded model runtime inside ProxySQL
ProxySQL links to an embedding runtime (llama.cpp, etc.)
Pros:
- no network dependency
- predictable latency if tuned
Cons:
- increases memory footprint
- needs careful resource controls
Recommendation Start with an external embedding provider and keep a modular interface that can be swapped later.
8. Query embedding generation
Vector search needs a query embedding. Do this in the MCP layer:
- Take
query_text - Apply query normalization (optional but recommended)
- Compute query embedding using the same model used for chunks
- Execute vector search SQL with a bound embedding vector
Do not
- accept arbitrary embedding vectors from untrusted callers without validation
- allow unbounded query lengths
9. Vector search semantics
9.1 Distance vs similarity
Depending on the embedding model and vec search primitive, vector search may return:
- cosine distance (lower is better)
- cosine similarity (higher is better)
- L2 distance (lower is better)
Recommendation Normalize to a “higher is better” score in MCP responses:
- if distance:
score_vec = 1 / (1 + distance)or similar monotonic transform
Keep raw distance in debug fields if needed.
9.2 Filtering
Filtering should be supported by:
source_idrestriction- optional metadata filters (doc-level or chunk-level)
In v0, filter by source_id is easiest because rag_vec_chunks stores source_id as metadata.
10. Hybrid retrieval integration
Embeddings are one leg of hybrid retrieval. Two recommended hybrid modes are described in mcp-tools.md:
- Fuse: top-N FTS and top-N vector, merged by chunk_id, fused by RRF
- FTS then vector: broad FTS candidates then vector rerank within candidates
Embeddings support both:
- Fuse mode needs global vector search top-N.
- Candidate mode needs vector search restricted to candidate chunk IDs.
Candidate mode is often cheaper and more precise when the query includes strong exact tokens.
11. Operational controls
11.1 Resource limits
Embedding generation must be bounded by:
- max chunk size embedded
- max chunks embedded per document
- per-source embedding rate limit
- timeouts when calling embedding provider
11.2 Batch embedding
To improve throughput, embed in batches:
- collect N chunks
- send embedding request for N inputs
- store results
11.3 Backpressure and async embedding
For v1, consider decoupling embedding generation from ingestion:
- ingestion stores chunks
- embedding worker processes “pending” chunks and fills vectors
This allows:
- ingestion to remain fast
- embedding to scale independently
- retries on embedding failures
In this design, store a state record:
- pending / ok / error
- last error message
- retry count
12. Recommended implementation steps (coding agent checklist)
v0 (synchronous embedding)
- Implement
embedding_jsonparsing in ingester - Build embedding input string for each chunk
- Call embedding provider (or use a stub in development)
- Insert vector rows into
rag_vec_chunks - Implement
rag.search_vectorMCP tool using query embedding + vector SQL
v1 (efficient incremental embedding)
- Add
rag_chunk_embedding_statetable - Store
input_hashper chunk per model - Only re-embed if hash changed
- Add async embedding worker option
- Add metrics for embedding throughput and failures
13. Summary
- Compute embeddings per chunk, not per document.
- Define embedding input explicitly in
rag_sources.embedding_json. - Store vectors in
rag_vec_chunks(vec0). - For production, add hash-based update detection and optional async embedding workers.
- Normalize vector scores in MCP responses and keep raw distance for debugging.