You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/RAG_POC/embeddings-design.md

9.9 KiB

ProxySQL RAG Index — Embeddings & Vector Retrieval Design (Chunk-Level) (v0→v1 Blueprint)

This document specifies how embeddings should be produced, stored, updated, and queried for chunk-level vector search in ProxySQLs RAG index. It is intended as an implementation blueprint.

It assumes:

  • Chunking is already implemented (rag_chunks).
  • ProxySQL includes sqlite3-vec and uses a vec0(...) virtual table (rag_vec_chunks).
  • Retrieval is exposed primarily via MCP tools (mcp-tools.md).

1. Design objectives

  1. Chunk-level embeddings

    • Each chunk receives its own embedding for retrieval precision.
  2. Deterministic embedding input

    • The text embedded is explicitly defined per source, not inferred.
  3. Model agility

    • The system can change embedding models/dimensions without breaking stored data or APIs.
  4. Efficient updates

    • Only recompute embeddings for chunks whose embedding input changed.
  5. Operational safety

    • Bound cost and latency (embedding generation can be expensive).
    • Allow asynchronous embedding jobs if needed later.

2. What to embed (and what not to embed)

2.1 Embed text that improves semantic retrieval

Recommended embedding input per chunk:

  • Document title (if present)
  • Tags (as plain text)
  • Chunk body

Example embedding input template:

{Title}
Tags: {Tags}

{ChunkBody}

This typically improves semantic recall significantly for knowledge-base-like content (StackOverflow posts, docs, tickets, runbooks).

2.2 Do NOT embed numeric metadata by default

Do not embed fields like Score, ViewCount, OwnerUserId, timestamps, etc. These should remain structured and be used for:

  • filtering
  • boosting
  • tie-breaking
  • result shaping

Embedding numeric metadata into text typically adds noise and reduces semantic quality.

2.3 Code and HTML considerations

If your chunk body contains HTML or code:

  • v0: embed raw text (works, but may be noisy)
  • v1: normalize to improve quality:
    • strip HTML tags (keep text content)
    • preserve code blocks as text, but consider stripping excessive markup
    • optionally create specialized “code-only” chunks for code-heavy sources

Normalization should be source-configurable.


3. Where embedding input rules are defined

Embedding input rules must be explicit and stored per source.

3.1 rag_sources.embedding_json

Recommended schema:

{
  "enabled": true,
  "model": "text-embedding-3-large",
  "dim": 1536,
  "input": {
    "concat": [
      {"col":"Title"},
      {"lit":"\nTags: "}, {"col":"Tags"},
      {"lit":"\n\n"},
      {"chunk_body": true}
    ]
  },
  "normalize": {
    "strip_html": true,
    "collapse_whitespace": true
  }
}

Semantics

  • enabled: whether to compute/store embeddings for this source
  • model: logical name (for observability and compatibility checks)
  • dim: vector dimension
  • input.concat: how to build embedding input text
  • normalize: optional normalization steps

4. Storage schema and model/versioning

4.1 Current v0 schema: single vector table

rag_vec_chunks stores:

  • embedding vector
  • chunk_id
  • doc_id/source_id convenience columns
  • updated_at

This is appropriate for v0 when you assume a single embedding model/dimension.

In a product setting, you may want multiple embedding models (e.g. general vs code-centric).

Two ways to support this:

Option A: include model identity columns in rag_vec_chunks

Add columns:

  • model TEXT
  • dim INTEGER (optional if fixed per model)

Then allow multiple rows per chunk_id (unique key becomes (chunk_id, model)). This may require schema change and a different vec0 design (some vec0 configurations support metadata columns, but uniqueness must be handled carefully).

Create:

  • rag_vec_chunks_1536_v1
  • rag_vec_chunks_1024_code_v1 etc.

Then MCP tools select the table based on requested model or default configuration.

Recommendation Start with Option A only if your sqlite3-vec build makes it easy to filter by model. Otherwise, Option B is operationally cleaner.


5. Embedding generation pipeline

5.1 When embeddings are created

Embeddings are created during ingestion, immediately after chunk creation, if embedding_json.enabled=true.

This provides a simple, synchronous pipeline:

  • ingest row → create chunks → compute embedding → store vector

5.2 When embeddings should be updated

Embeddings must be recomputed if the embedding input string changes. That depends on:

  • title changes
  • tags changes
  • chunk body changes
  • normalization rules changes (strip_html etc.)
  • embedding model changes

Therefore, update logic should be based on a content hash of the embedding input.


6. Content hashing for efficient updates (v1 recommendation)

6.1 Why hashing is needed

Without hashing, you might recompute embeddings unnecessarily:

  • expensive
  • slow
  • prevents incremental sync from being efficient

Store embedding_input_hash per chunk per model.

Implementation options:

Option A: Store hash in rag_chunks.metadata_json

Example:

{
  "chunk_index": 0,
  "embedding_hash": "sha256:...",
  "embedding_model": "text-embedding-3-large"
}

Pros: no schema changes.
Cons: JSON parsing overhead.

Create rag_chunk_embedding_state:

CREATE TABLE rag_chunk_embedding_state (
  chunk_id   TEXT NOT NULL,
  model      TEXT NOT NULL,
  dim        INTEGER NOT NULL,
  input_hash TEXT NOT NULL,
  updated_at INTEGER NOT NULL DEFAULT (unixepoch()),
  PRIMARY KEY(chunk_id, model)
);

Pros: fast lookups; avoids JSON parsing.
Cons: extra table.

Recommendation Use Option B for v1.


7. Embedding model integration options

ProxySQL calls an embedding service:

  • OpenAI-compatible endpoint, or
  • local service (e.g. llama.cpp server), or
  • vendor-specific embedding API

Pros:

  • easy to iterate on model choice
  • isolates ML runtime from ProxySQL process

Cons:

  • network latency; requires caching and timeouts

7.2 Embedded model runtime inside ProxySQL

ProxySQL links to an embedding runtime (llama.cpp, etc.)

Pros:

  • no network dependency
  • predictable latency if tuned

Cons:

  • increases memory footprint
  • needs careful resource controls

Recommendation Start with an external embedding provider and keep a modular interface that can be swapped later.


8. Query embedding generation

Vector search needs a query embedding. Do this in the MCP layer:

  1. Take query_text
  2. Apply query normalization (optional but recommended)
  3. Compute query embedding using the same model used for chunks
  4. Execute vector search SQL with a bound embedding vector

Do not

  • accept arbitrary embedding vectors from untrusted callers without validation
  • allow unbounded query lengths

9. Vector search semantics

9.1 Distance vs similarity

Depending on the embedding model and vec search primitive, vector search may return:

  • cosine distance (lower is better)
  • cosine similarity (higher is better)
  • L2 distance (lower is better)

Recommendation Normalize to a “higher is better” score in MCP responses:

  • if distance: score_vec = 1 / (1 + distance) or similar monotonic transform

Keep raw distance in debug fields if needed.

9.2 Filtering

Filtering should be supported by:

  • source_id restriction
  • optional metadata filters (doc-level or chunk-level)

In v0, filter by source_id is easiest because rag_vec_chunks stores source_id as metadata.


10. Hybrid retrieval integration

Embeddings are one leg of hybrid retrieval. Two recommended hybrid modes are described in mcp-tools.md:

  1. Fuse: top-N FTS and top-N vector, merged by chunk_id, fused by RRF
  2. FTS then vector: broad FTS candidates then vector rerank within candidates

Embeddings support both:

  • Fuse mode needs global vector search top-N.
  • Candidate mode needs vector search restricted to candidate chunk IDs.

Candidate mode is often cheaper and more precise when the query includes strong exact tokens.


11. Operational controls

11.1 Resource limits

Embedding generation must be bounded by:

  • max chunk size embedded
  • max chunks embedded per document
  • per-source embedding rate limit
  • timeouts when calling embedding provider

11.2 Batch embedding

To improve throughput, embed in batches:

  • collect N chunks
  • send embedding request for N inputs
  • store results

11.3 Backpressure and async embedding

For v1, consider decoupling embedding generation from ingestion:

  • ingestion stores chunks
  • embedding worker processes “pending” chunks and fills vectors

This allows:

  • ingestion to remain fast
  • embedding to scale independently
  • retries on embedding failures

In this design, store a state record:

  • pending / ok / error
  • last error message
  • retry count

v0 (synchronous embedding)

  1. Implement embedding_json parsing in ingester
  2. Build embedding input string for each chunk
  3. Call embedding provider (or use a stub in development)
  4. Insert vector rows into rag_vec_chunks
  5. Implement rag.search_vector MCP tool using query embedding + vector SQL

v1 (efficient incremental embedding)

  1. Add rag_chunk_embedding_state table
  2. Store input_hash per chunk per model
  3. Only re-embed if hash changed
  4. Add async embedding worker option
  5. Add metrics for embedding throughput and failures

13. Summary

  • Compute embeddings per chunk, not per document.
  • Define embedding input explicitly in rag_sources.embedding_json.
  • Store vectors in rag_vec_chunks (vec0).
  • For production, add hash-based update detection and optional async embedding workers.
  • Normalize vector scores in MCP responses and keep raw distance for debugging.