You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/RAG_POC/sql-examples.md

8.1 KiB

ProxySQL RAG Index — SQL Examples (FTS, Vectors, Hybrid)

This file provides concrete SQL examples for querying the ProxySQL-hosted SQLite RAG index directly (for debugging, internal dashboards, or SQL-native applications).

The preferred interface for AI agents remains MCP tools (mcp-tools.md). SQL access should typically be restricted to trusted callers.

Assumed tables:

  • rag_documents
  • rag_chunks
  • rag_fts_chunks (FTS5)
  • rag_vec_chunks (sqlite3-vec vec0 table)

0. Common joins and inspection

0.1 Inspect one document and its chunks

SELECT * FROM rag_documents WHERE doc_id = 'posts:12345';
SELECT * FROM rag_chunks WHERE doc_id = 'posts:12345' ORDER BY chunk_index;

0.2 Use the convenience view (if enabled)

SELECT * FROM rag_chunk_view WHERE doc_id = 'posts:12345' ORDER BY chunk_id;

1. FTS5 examples

1.1 Basic FTS search (top 10)

SELECT
  f.chunk_id,
  bm25(rag_fts_chunks) AS score_fts_raw
FROM rag_fts_chunks f
WHERE rag_fts_chunks MATCH 'json_extract mysql'
ORDER BY score_fts_raw
LIMIT 10;

1.2 Join FTS results to chunk text and document metadata

SELECT
  f.chunk_id,
  bm25(rag_fts_chunks) AS score_fts_raw,
  c.doc_id,
  COALESCE(c.title, d.title) AS title,
  c.body AS chunk_body,
  d.metadata_json AS doc_metadata_json
FROM rag_fts_chunks f
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
JOIN rag_documents d ON d.doc_id = c.doc_id
WHERE rag_fts_chunks MATCH 'json_extract mysql'
  AND c.deleted = 0 AND d.deleted = 0
ORDER BY score_fts_raw
LIMIT 10;

1.3 Apply a source filter (by source_id)

SELECT
  f.chunk_id,
  bm25(rag_fts_chunks) AS score_fts_raw
FROM rag_fts_chunks f
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
WHERE rag_fts_chunks MATCH 'replication lag'
  AND c.source_id = 1
ORDER BY score_fts_raw
LIMIT 20;

1.4 Phrase queries, boolean operators (FTS5)

-- phrase
SELECT chunk_id FROM rag_fts_chunks
WHERE rag_fts_chunks MATCH '"group replication"'
LIMIT 20;

-- boolean: term1 AND term2
SELECT chunk_id FROM rag_fts_chunks
WHERE rag_fts_chunks MATCH 'mysql AND deadlock'
LIMIT 20;

-- boolean: term1 NOT term2
SELECT chunk_id FROM rag_fts_chunks
WHERE rag_fts_chunks MATCH 'mysql NOT mariadb'
LIMIT 20;

2. Vector search examples (sqlite3-vec)

Vector SQL varies slightly depending on sqlite3-vec build and how you bind vectors. Below are two patterns you can implement in ProxySQL.

In this pattern, ProxySQL:

  1. Computes the query embedding in C++
  2. Executes SQL with a bound parameter :qvec representing the embedding

A typical “nearest neighbors” query shape is:

-- PSEUDOCODE: adapt to sqlite3-vec's exact operator/function in your build.
SELECT
  v.chunk_id,
  v.distance AS distance_raw
FROM rag_vec_chunks v
WHERE v.embedding MATCH :qvec
ORDER BY distance_raw
LIMIT 10;

Then join to chunks:

-- PSEUDOCODE: join with content and metadata
SELECT
  v.chunk_id,
  v.distance AS distance_raw,
  c.doc_id,
  c.body AS chunk_body,
  d.metadata_json AS doc_metadata_json
FROM (
  SELECT chunk_id, distance
  FROM rag_vec_chunks
  WHERE embedding MATCH :qvec
  ORDER BY distance
  LIMIT 10
) v
JOIN rag_chunks c ON c.chunk_id = v.chunk_id
JOIN rag_documents d ON d.doc_id = c.doc_id;

2.2 Pattern B (debug): store a query vector in a temporary table

This is useful when you want to run vector queries manually in SQL without MCP support.

CREATE TEMP TABLE tmp_query_vec(qvec BLOB);
-- Insert the query vector (float32 array blob). The insertion is usually done by tooling, not manually.
-- INSERT INTO tmp_query_vec VALUES (X'...');

-- PSEUDOCODE: use tmp_query_vec.qvec as the query embedding
SELECT
  v.chunk_id,
  v.distance
FROM rag_vec_chunks v, tmp_query_vec t
WHERE v.embedding MATCH t.qvec
ORDER BY v.distance
LIMIT 10;

3. Hybrid search examples

Hybrid retrieval is best implemented in the MCP layer because it mixes ranking systems and needs careful bounding. However, you can approximate hybrid behavior using SQL to validate logic.

3.1 Hybrid Mode A: Parallel FTS + Vector then fuse (RRF)

Step 1: FTS top 50 (ranked)

WITH fts AS (
  SELECT
    f.chunk_id,
    bm25(rag_fts_chunks) AS score_fts_raw
  FROM rag_fts_chunks f
  WHERE rag_fts_chunks MATCH :fts_query
  ORDER BY score_fts_raw
  LIMIT 50
)
SELECT * FROM fts;

Step 2: Vector top 50 (ranked)

WITH vec AS (
  SELECT
    v.chunk_id,
    v.distance AS distance_raw
  FROM rag_vec_chunks v
  WHERE v.embedding MATCH :qvec
  ORDER BY v.distance
  LIMIT 50
)
SELECT * FROM vec;

Step 3: Fuse via Reciprocal Rank Fusion (RRF)

In SQL you need ranks. SQLite supports window functions in modern builds.

WITH
fts AS (
  SELECT
    f.chunk_id,
    bm25(rag_fts_chunks) AS score_fts_raw,
    ROW_NUMBER() OVER (ORDER BY bm25(rag_fts_chunks)) AS rank_fts
  FROM rag_fts_chunks f
  WHERE rag_fts_chunks MATCH :fts_query
  LIMIT 50
),
vec AS (
  SELECT
    v.chunk_id,
    v.distance AS distance_raw,
    ROW_NUMBER() OVER (ORDER BY v.distance) AS rank_vec
  FROM rag_vec_chunks v
  WHERE v.embedding MATCH :qvec
  LIMIT 50
),
merged AS (
  SELECT
    COALESCE(fts.chunk_id, vec.chunk_id) AS chunk_id,
    fts.rank_fts,
    vec.rank_vec,
    fts.score_fts_raw,
    vec.distance_raw
  FROM fts
  FULL OUTER JOIN vec ON vec.chunk_id = fts.chunk_id
),
rrf AS (
  SELECT
    chunk_id,
    score_fts_raw,
    distance_raw,
    rank_fts,
    rank_vec,
    (1.0 / (60.0 + COALESCE(rank_fts, 1000000))) +
    (1.0 / (60.0 + COALESCE(rank_vec, 1000000))) AS score_rrf
  FROM merged
)
SELECT
  r.chunk_id,
  r.score_rrf,
  c.doc_id,
  c.body AS chunk_body
FROM rrf r
JOIN rag_chunks c ON c.chunk_id = r.chunk_id
ORDER BY r.score_rrf DESC
LIMIT 10;

Important: SQLite does not support FULL OUTER JOIN directly in all builds. For production, implement the merge/fuse in C++ (MCP layer). This SQL is illustrative.

3.2 Hybrid Mode B: Broad FTS then vector rerank (candidate generation)

Step 1: FTS candidate set (top 200)

WITH candidates AS (
  SELECT
    f.chunk_id,
    bm25(rag_fts_chunks) AS score_fts_raw
  FROM rag_fts_chunks f
  WHERE rag_fts_chunks MATCH :fts_query
  ORDER BY score_fts_raw
  LIMIT 200
)
SELECT * FROM candidates;

Step 2: Vector rerank within candidates

Conceptually:

  • Join candidates to rag_vec_chunks and compute distance to :qvec
  • Keep top 10
WITH candidates AS (
  SELECT
    f.chunk_id
  FROM rag_fts_chunks f
  WHERE rag_fts_chunks MATCH :fts_query
  ORDER BY bm25(rag_fts_chunks)
  LIMIT 200
),
reranked AS (
  SELECT
    v.chunk_id,
    v.distance AS distance_raw
  FROM rag_vec_chunks v
  JOIN candidates c ON c.chunk_id = v.chunk_id
  WHERE v.embedding MATCH :qvec
  ORDER BY v.distance
  LIMIT 10
)
SELECT
  r.chunk_id,
  r.distance_raw,
  ch.doc_id,
  ch.body
FROM reranked r
JOIN rag_chunks ch ON ch.chunk_id = r.chunk_id;

As above, the exact MATCH :qvec syntax may need adaptation to your sqlite3-vec build; implement vector query execution in C++ and keep SQL as internal glue.


4. Common “application-friendly” queries

4.1 Return doc_id + score + title only (no bodies)

SELECT
  f.chunk_id,
  c.doc_id,
  COALESCE(c.title, d.title) AS title,
  bm25(rag_fts_chunks) AS score_fts_raw
FROM rag_fts_chunks f
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
JOIN rag_documents d ON d.doc_id = c.doc_id
WHERE rag_fts_chunks MATCH :q
ORDER BY score_fts_raw
LIMIT 20;

4.2 Return top doc_ids (deduplicate by doc_id)

WITH ranked_chunks AS (
  SELECT
    c.doc_id,
    bm25(rag_fts_chunks) AS score_fts_raw
  FROM rag_fts_chunks f
  JOIN rag_chunks c ON c.chunk_id = f.chunk_id
  WHERE rag_fts_chunks MATCH :q
  ORDER BY score_fts_raw
  LIMIT 200
)
SELECT doc_id, MIN(score_fts_raw) AS best_score
FROM ranked_chunks
GROUP BY doc_id
ORDER BY best_score
LIMIT 20;

5. Practical guidance

  • Use SQL mode mainly for debugging and internal tooling.
  • Prefer MCP tools for agent interaction:
    • stable schemas
    • strong guardrails
    • consistent hybrid scoring
  • Implement hybrid fusion in C++ (not in SQL) to avoid dialect limitations and to keep scoring correct.