mirror of https://github.com/sysown/proxysql
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
349 lines
8.1 KiB
349 lines
8.1 KiB
# ProxySQL RAG Index — SQL Examples (FTS, Vectors, Hybrid)
|
|
|
|
This file provides concrete SQL examples for querying the ProxySQL-hosted SQLite RAG index directly (for debugging, internal dashboards, or SQL-native applications).
|
|
|
|
The **preferred interface for AI agents** remains MCP tools (`mcp-tools.md`). SQL access should typically be restricted to trusted callers.
|
|
|
|
Assumed tables:
|
|
- `rag_documents`
|
|
- `rag_chunks`
|
|
- `rag_fts_chunks` (FTS5)
|
|
- `rag_vec_chunks` (sqlite3-vec vec0 table)
|
|
|
|
---
|
|
|
|
## 0. Common joins and inspection
|
|
|
|
### 0.1 Inspect one document and its chunks
|
|
```sql
|
|
SELECT * FROM rag_documents WHERE doc_id = 'posts:12345';
|
|
SELECT * FROM rag_chunks WHERE doc_id = 'posts:12345' ORDER BY chunk_index;
|
|
```
|
|
|
|
### 0.2 Use the convenience view (if enabled)
|
|
```sql
|
|
SELECT * FROM rag_chunk_view WHERE doc_id = 'posts:12345' ORDER BY chunk_id;
|
|
```
|
|
|
|
---
|
|
|
|
## 1. FTS5 examples
|
|
|
|
### 1.1 Basic FTS search (top 10)
|
|
```sql
|
|
SELECT
|
|
f.chunk_id,
|
|
bm25(rag_fts_chunks) AS score_fts_raw
|
|
FROM rag_fts_chunks f
|
|
WHERE rag_fts_chunks MATCH 'json_extract mysql'
|
|
ORDER BY score_fts_raw
|
|
LIMIT 10;
|
|
```
|
|
|
|
### 1.2 Join FTS results to chunk text and document metadata
|
|
```sql
|
|
SELECT
|
|
f.chunk_id,
|
|
bm25(rag_fts_chunks) AS score_fts_raw,
|
|
c.doc_id,
|
|
COALESCE(c.title, d.title) AS title,
|
|
c.body AS chunk_body,
|
|
d.metadata_json AS doc_metadata_json
|
|
FROM rag_fts_chunks f
|
|
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
|
|
JOIN rag_documents d ON d.doc_id = c.doc_id
|
|
WHERE rag_fts_chunks MATCH 'json_extract mysql'
|
|
AND c.deleted = 0 AND d.deleted = 0
|
|
ORDER BY score_fts_raw
|
|
LIMIT 10;
|
|
```
|
|
|
|
### 1.3 Apply a source filter (by source_id)
|
|
```sql
|
|
SELECT
|
|
f.chunk_id,
|
|
bm25(rag_fts_chunks) AS score_fts_raw
|
|
FROM rag_fts_chunks f
|
|
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
|
|
WHERE rag_fts_chunks MATCH 'replication lag'
|
|
AND c.source_id = 1
|
|
ORDER BY score_fts_raw
|
|
LIMIT 20;
|
|
```
|
|
|
|
### 1.4 Phrase queries, boolean operators (FTS5)
|
|
```sql
|
|
-- phrase
|
|
SELECT chunk_id FROM rag_fts_chunks
|
|
WHERE rag_fts_chunks MATCH '"group replication"'
|
|
LIMIT 20;
|
|
|
|
-- boolean: term1 AND term2
|
|
SELECT chunk_id FROM rag_fts_chunks
|
|
WHERE rag_fts_chunks MATCH 'mysql AND deadlock'
|
|
LIMIT 20;
|
|
|
|
-- boolean: term1 NOT term2
|
|
SELECT chunk_id FROM rag_fts_chunks
|
|
WHERE rag_fts_chunks MATCH 'mysql NOT mariadb'
|
|
LIMIT 20;
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Vector search examples (sqlite3-vec)
|
|
|
|
Vector SQL varies slightly depending on sqlite3-vec build and how you bind vectors.
|
|
Below are **two patterns** you can implement in ProxySQL.
|
|
|
|
### 2.1 Pattern A (recommended): ProxySQL computes embeddings; SQL receives a bound vector
|
|
In this pattern, ProxySQL:
|
|
1) Computes the query embedding in C++
|
|
2) Executes SQL with a bound parameter `:qvec` representing the embedding
|
|
|
|
A typical “nearest neighbors” query shape is:
|
|
|
|
```sql
|
|
-- PSEUDOCODE: adapt to sqlite3-vec's exact operator/function in your build.
|
|
SELECT
|
|
v.chunk_id,
|
|
v.distance AS distance_raw
|
|
FROM rag_vec_chunks v
|
|
WHERE v.embedding MATCH :qvec
|
|
ORDER BY distance_raw
|
|
LIMIT 10;
|
|
```
|
|
|
|
Then join to chunks:
|
|
```sql
|
|
-- PSEUDOCODE: join with content and metadata
|
|
SELECT
|
|
v.chunk_id,
|
|
v.distance AS distance_raw,
|
|
c.doc_id,
|
|
c.body AS chunk_body,
|
|
d.metadata_json AS doc_metadata_json
|
|
FROM (
|
|
SELECT chunk_id, distance
|
|
FROM rag_vec_chunks
|
|
WHERE embedding MATCH :qvec
|
|
ORDER BY distance
|
|
LIMIT 10
|
|
) v
|
|
JOIN rag_chunks c ON c.chunk_id = v.chunk_id
|
|
JOIN rag_documents d ON d.doc_id = c.doc_id;
|
|
```
|
|
|
|
### 2.2 Pattern B (debug): store a query vector in a temporary table
|
|
This is useful when you want to run vector queries manually in SQL without MCP support.
|
|
|
|
```sql
|
|
CREATE TEMP TABLE tmp_query_vec(qvec BLOB);
|
|
-- Insert the query vector (float32 array blob). The insertion is usually done by tooling, not manually.
|
|
-- INSERT INTO tmp_query_vec VALUES (X'...');
|
|
|
|
-- PSEUDOCODE: use tmp_query_vec.qvec as the query embedding
|
|
SELECT
|
|
v.chunk_id,
|
|
v.distance
|
|
FROM rag_vec_chunks v, tmp_query_vec t
|
|
WHERE v.embedding MATCH t.qvec
|
|
ORDER BY v.distance
|
|
LIMIT 10;
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Hybrid search examples
|
|
|
|
Hybrid retrieval is best implemented in the MCP layer because it mixes ranking systems and needs careful bounding.
|
|
However, you can approximate hybrid behavior using SQL to validate logic.
|
|
|
|
### 3.1 Hybrid Mode A: Parallel FTS + Vector then fuse (RRF)
|
|
|
|
#### Step 1: FTS top 50 (ranked)
|
|
```sql
|
|
WITH fts AS (
|
|
SELECT
|
|
f.chunk_id,
|
|
bm25(rag_fts_chunks) AS score_fts_raw
|
|
FROM rag_fts_chunks f
|
|
WHERE rag_fts_chunks MATCH :fts_query
|
|
ORDER BY score_fts_raw
|
|
LIMIT 50
|
|
)
|
|
SELECT * FROM fts;
|
|
```
|
|
|
|
#### Step 2: Vector top 50 (ranked)
|
|
```sql
|
|
WITH vec AS (
|
|
SELECT
|
|
v.chunk_id,
|
|
v.distance AS distance_raw
|
|
FROM rag_vec_chunks v
|
|
WHERE v.embedding MATCH :qvec
|
|
ORDER BY v.distance
|
|
LIMIT 50
|
|
)
|
|
SELECT * FROM vec;
|
|
```
|
|
|
|
#### Step 3: Fuse via Reciprocal Rank Fusion (RRF)
|
|
In SQL you need ranks. SQLite supports window functions in modern builds.
|
|
|
|
```sql
|
|
WITH
|
|
fts AS (
|
|
SELECT
|
|
f.chunk_id,
|
|
bm25(rag_fts_chunks) AS score_fts_raw,
|
|
ROW_NUMBER() OVER (ORDER BY bm25(rag_fts_chunks)) AS rank_fts
|
|
FROM rag_fts_chunks f
|
|
WHERE rag_fts_chunks MATCH :fts_query
|
|
LIMIT 50
|
|
),
|
|
vec AS (
|
|
SELECT
|
|
v.chunk_id,
|
|
v.distance AS distance_raw,
|
|
ROW_NUMBER() OVER (ORDER BY v.distance) AS rank_vec
|
|
FROM rag_vec_chunks v
|
|
WHERE v.embedding MATCH :qvec
|
|
LIMIT 50
|
|
),
|
|
merged AS (
|
|
SELECT
|
|
COALESCE(fts.chunk_id, vec.chunk_id) AS chunk_id,
|
|
fts.rank_fts,
|
|
vec.rank_vec,
|
|
fts.score_fts_raw,
|
|
vec.distance_raw
|
|
FROM fts
|
|
FULL OUTER JOIN vec ON vec.chunk_id = fts.chunk_id
|
|
),
|
|
rrf AS (
|
|
SELECT
|
|
chunk_id,
|
|
score_fts_raw,
|
|
distance_raw,
|
|
rank_fts,
|
|
rank_vec,
|
|
(1.0 / (60.0 + COALESCE(rank_fts, 1000000))) +
|
|
(1.0 / (60.0 + COALESCE(rank_vec, 1000000))) AS score_rrf
|
|
FROM merged
|
|
)
|
|
SELECT
|
|
r.chunk_id,
|
|
r.score_rrf,
|
|
c.doc_id,
|
|
c.body AS chunk_body
|
|
FROM rrf r
|
|
JOIN rag_chunks c ON c.chunk_id = r.chunk_id
|
|
ORDER BY r.score_rrf DESC
|
|
LIMIT 10;
|
|
```
|
|
|
|
**Important**: SQLite does not support `FULL OUTER JOIN` directly in all builds.
|
|
For production, implement the merge/fuse in C++ (MCP layer). This SQL is illustrative.
|
|
|
|
### 3.2 Hybrid Mode B: Broad FTS then vector rerank (candidate generation)
|
|
|
|
#### Step 1: FTS candidate set (top 200)
|
|
```sql
|
|
WITH candidates AS (
|
|
SELECT
|
|
f.chunk_id,
|
|
bm25(rag_fts_chunks) AS score_fts_raw
|
|
FROM rag_fts_chunks f
|
|
WHERE rag_fts_chunks MATCH :fts_query
|
|
ORDER BY score_fts_raw
|
|
LIMIT 200
|
|
)
|
|
SELECT * FROM candidates;
|
|
```
|
|
|
|
#### Step 2: Vector rerank within candidates
|
|
Conceptually:
|
|
- Join candidates to `rag_vec_chunks` and compute distance to `:qvec`
|
|
- Keep top 10
|
|
|
|
```sql
|
|
WITH candidates AS (
|
|
SELECT
|
|
f.chunk_id
|
|
FROM rag_fts_chunks f
|
|
WHERE rag_fts_chunks MATCH :fts_query
|
|
ORDER BY bm25(rag_fts_chunks)
|
|
LIMIT 200
|
|
),
|
|
reranked AS (
|
|
SELECT
|
|
v.chunk_id,
|
|
v.distance AS distance_raw
|
|
FROM rag_vec_chunks v
|
|
JOIN candidates c ON c.chunk_id = v.chunk_id
|
|
WHERE v.embedding MATCH :qvec
|
|
ORDER BY v.distance
|
|
LIMIT 10
|
|
)
|
|
SELECT
|
|
r.chunk_id,
|
|
r.distance_raw,
|
|
ch.doc_id,
|
|
ch.body
|
|
FROM reranked r
|
|
JOIN rag_chunks ch ON ch.chunk_id = r.chunk_id;
|
|
```
|
|
|
|
As above, the exact `MATCH :qvec` syntax may need adaptation to your sqlite3-vec build; implement vector query execution in C++ and keep SQL as internal glue.
|
|
|
|
---
|
|
|
|
## 4. Common “application-friendly” queries
|
|
|
|
### 4.1 Return doc_id + score + title only (no bodies)
|
|
```sql
|
|
SELECT
|
|
f.chunk_id,
|
|
c.doc_id,
|
|
COALESCE(c.title, d.title) AS title,
|
|
bm25(rag_fts_chunks) AS score_fts_raw
|
|
FROM rag_fts_chunks f
|
|
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
|
|
JOIN rag_documents d ON d.doc_id = c.doc_id
|
|
WHERE rag_fts_chunks MATCH :q
|
|
ORDER BY score_fts_raw
|
|
LIMIT 20;
|
|
```
|
|
|
|
### 4.2 Return top doc_ids (deduplicate by doc_id)
|
|
```sql
|
|
WITH ranked_chunks AS (
|
|
SELECT
|
|
c.doc_id,
|
|
bm25(rag_fts_chunks) AS score_fts_raw
|
|
FROM rag_fts_chunks f
|
|
JOIN rag_chunks c ON c.chunk_id = f.chunk_id
|
|
WHERE rag_fts_chunks MATCH :q
|
|
ORDER BY score_fts_raw
|
|
LIMIT 200
|
|
)
|
|
SELECT doc_id, MIN(score_fts_raw) AS best_score
|
|
FROM ranked_chunks
|
|
GROUP BY doc_id
|
|
ORDER BY best_score
|
|
LIMIT 20;
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Practical guidance
|
|
|
|
- Use SQL mode mainly for debugging and internal tooling.
|
|
- Prefer MCP tools for agent interaction:
|
|
- stable schemas
|
|
- strong guardrails
|
|
- consistent hybrid scoring
|
|
- Implement hybrid fusion in C++ (not in SQL) to avoid dialect limitations and to keep scoring correct.
|