proxysql/RAG_POC/architecture-data-model.md

# ProxySQL RAG Index — Data Model & Ingestion Architecture (v0 Blueprint)

This document explains the SQLite data model used to turn relational tables (e.g. MySQL `posts`) into a retrieval-friendly index hosted inside ProxySQL. It focuses on:

- What each SQLite table does
- How tables relate to each other
- How `rag_sources` defines **explicit mapping rules** (no guessing)
- How ingestion transforms rows into documents and chunks
- How FTS and vector indexes are maintained
- What evolves later for incremental sync and updates

---

## 1. Goal and core idea

Relational databases are excellent for structured queries, but RAG-style retrieval needs:

- Fast keyword search (error messages, identifiers, tags)
- Fast semantic search (similar meaning, paraphrased questions)
- A stable way to “refetch the authoritative data” from the source DB

The model below implements a **canonical document layer** inside ProxySQL:

1. Ingest selected rows from a source database (MySQL, PostgreSQL, etc.)
2. Convert each row into a **document** (title/body + metadata)
3. Split long bodies into **chunks**
4. Index chunks in:
   - **FTS5** for keyword search
   - **sqlite3-vec** for vector similarity
5. Serve retrieval through stable APIs (MCP or SQL), independent of where indexes physically live in the future

---

## 2. The SQLite tables (what they are and why they exist)

### 2.1 `rag_sources` — control plane: “what to ingest and how”

**Purpose**
- Defines each ingestion source (a table or view in an external DB)
- Stores *explicit* transformation rules:
  - which columns become `title`, `body`
  - which columns go into `metadata_json`
  - how to build `doc_id`
- Stores chunking strategy and embedding strategy configuration

**Key columns**
- `backend_*`: how to connect (v0 connects directly; later may be “via ProxySQL”)
- `table_name`, `pk_column`: what to ingest
- `where_sql`: optional restriction (e.g. only questions)
- `doc_map_json`: mapping rules (required)
- `chunking_json`: chunking rules (required)
- `embedding_json`: embedding rules (optional)

**Important**: `rag_sources` is the **only place** that defines mapping logic.
A general-purpose ingester must never “guess” which fields belong to `body` or metadata.

---

### 2.2 `rag_documents` — canonical documents: “one per source row”

**Purpose**
- Represents the canonical document created from a single source row.
- Stores:
  - a stable identifier (`doc_id`)
  - a refetch pointer (`pk_json`)
  - document text (`title`, `body`)
  - structured metadata (`metadata_json`)

**Why store full `body` here?**
- Enables re-chunking later without re-fetching from the source DB.
- Makes debugging and inspection easier.
- Supports future update detection and diffing.

**Key columns**
- `doc_id` (PK): stable across runs and machines (e.g. `"posts:12345"`)
- `source_id`: ties back to `rag_sources`
- `pk_json`: how to refetch the authoritative row later (e.g. `{"Id":12345}`)
- `title`, `body`: canonical text
- `metadata_json`: non-text signals used for filters/boosting
- `updated_at`, `deleted`: lifecycle fields for incremental sync later

---

### 2.3 `rag_chunks` — retrieval units: “one or many per document”

**Purpose**
- Stores chunked versions of a document’s text.
- Retrieval and embeddings are performed at the chunk level for better quality.

**Why chunk at all?**
- Long bodies reduce retrieval quality:
  - FTS returns large documents where only a small part is relevant
  - Vector embeddings of large texts smear multiple topics together
- Chunking yields:
  - better precision
  - better citations (“this chunk”) and smaller context
  - cheaper updates (only re-embed changed chunks later)

**Key columns**
- `chunk_id` (PK): stable, derived from doc_id + chunk index (e.g. `"posts:12345#0"`)
- `doc_id` (FK): parent document
- `source_id`: convenience for filtering without joining documents
- `chunk_index`: 0..N-1
- `title`, `body`: chunk text (often title repeated for context)
- `metadata_json`: optional chunk-level metadata (offsets, “has_code”, section label)
- `updated_at`, `deleted`: lifecycle for later incremental sync

---

### 2.4 `rag_fts_chunks` — FTS5 index (contentless)

**Purpose**
- Keyword search index for chunks.
- Best for:
  - exact terms
  - identifiers
  - error messages
  - tags and code tokens (depending on tokenization)

**Design choice: contentless FTS**
- The FTS virtual table does not automatically mirror `rag_chunks`.
- The ingester explicitly inserts into FTS as chunks are created.
- This makes ingestion deterministic and avoids surprises when chunk bodies change later.

**Stored fields**
- `chunk_id` (unindexed, acts like a row identifier)
- `title`, `body` (indexed)

---

### 2.5 `rag_vec_chunks` — vector index (sqlite3-vec)

**Purpose**
- Semantic similarity search over chunks.
- Each chunk has a vector embedding.

**Key columns**
- `embedding float[DIM]`: embedding vector (DIM must match your model)
- `chunk_id`: join key to `rag_chunks`
- Optional metadata columns:
  - `doc_id`, `source_id`, `updated_at`
  - These help filtering and joining and are valuable for performance.

**Note**
- The ingester decides what text is embedded (chunk body alone, or “Title + Tags + Body chunk”).

---

### 2.6 Optional convenience objects
- `rag_chunk_view`: joins `rag_chunks` with `rag_documents` for debugging/inspection
- `rag_sync_state`: reserved for incremental sync later (not used in v0)

---

## 3. Table relationships (the graph)

Think of this as a data pipeline graph:

```text
rag_sources
   (defines mapping + chunking + embedding)
        |
        v
rag_documents  (1 row per source row)
        |
        v
rag_chunks     (1..N chunks per document)
     /     \
    v       v
rag_fts    rag_vec
```

**Cardinality**
- `rag_sources (1) -> rag_documents (N)`
- `rag_documents (1) -> rag_chunks (N)`
- `rag_chunks (1) -> rag_fts_chunks (1)` (insertion done by ingester)
- `rag_chunks (1) -> rag_vec_chunks (0/1+)` (0 if embeddings disabled; 1 typically)

---

## 4. How mapping is defined (no guessing)

### 4.1 Why `doc_map_json` exists
A general-purpose system cannot infer that:
- `posts.Body` should become document body
- `posts.Title` should become title
- `Score`, `Tags`, `CreationDate`, etc. should become metadata
- Or how to concatenate fields

Therefore, `doc_map_json` is required.

### 4.2 `doc_map_json` structure (v0)
`doc_map_json` defines:

- `doc_id.format`: string template with `{ColumnName}` placeholders
- `title.concat`: concatenation spec
- `body.concat`: concatenation spec
- `metadata.pick`: list of column names to include in metadata JSON
- `metadata.rename`: mapping of old key -> new key (useful for typos or schema differences)

**Concatenation parts**
- `{"col":"Column"}` — appends the column value (if present)
- `{"lit":"..."} ` — appends a literal string

Example (posts-like):

```json
{
  "doc_id": { "format": "posts:{Id}" },
  "title":  { "concat": [ { "col": "Title" } ] },
  "body":   { "concat": [ { "col": "Body" } ] },
  "metadata": {
    "pick": ["Id","PostTypeId","Tags","Score","CreaionDate"],
    "rename": {"CreaionDate":"CreationDate"}
  }
}
```

---

## 5. Chunking strategy definition

### 5.1 Why chunking is configured per source
Different tables need different chunking:
- StackOverflow `Body` may be long -> chunking recommended
- Small “reference” tables may not need chunking at all

Thus chunking is stored in `rag_sources.chunking_json`.

### 5.2 `chunking_json` structure (v0)
v0 supports **chars-based** chunking (simple, robust).

```json
{
  "enabled": true,
  "unit": "chars",
  "chunk_size": 4000,
  "overlap": 400,
  "min_chunk_size": 800
}
```

**Behavior**
- If `body.length <= chunk_size` -> one chunk
- Else chunks of `chunk_size` with `overlap`
- Avoid tiny final chunks by appending the tail to the previous chunk if below `min_chunk_size`

**Why overlap matters**
- Prevents splitting a key sentence or code snippet across boundaries
- Improves both FTS and semantic retrieval consistency

---

## 6. Embedding strategy definition (where it fits in the model)

### 6.1 Why embeddings are per chunk
- Better retrieval precision
- Smaller context per match
- Allows partial updates later (only re-embed changed chunks)

### 6.2 `embedding_json` structure (v0)
```json
{
  "enabled": true,
  "dim": 1536,
  "model": "text-embedding-3-large",
  "input": { "concat": [
    {"col":"Title"},
    {"lit":"\nTags: "}, {"col":"Tags"},
    {"lit":"\n\n"},
    {"chunk_body": true}
  ]}
}
```

**Meaning**
- Build embedding input text from:
  - title
  - tags (as plain text)
  - chunk body

This improves semantic retrieval for question-like content without embedding numeric metadata.

---

## 7. Ingestion lifecycle (step-by-step)

For each enabled `rag_sources` entry:

1. **Connect** to source DB using `backend_*`
2. **Select rows** from `table_name` (and optional `where_sql`)
   - Select only needed columns determined by `doc_map_json` and `embedding_json`
3. For each row:
   - Build `doc_id` using `doc_map_json.doc_id.format`
   - Build `pk_json` from `pk_column`
   - Build `title` using `title.concat`
   - Build `body` using `body.concat`
   - Build `metadata_json` using `metadata.pick` and `metadata.rename`
4. **Skip** if `doc_id` already exists (v0 behavior)
5. Insert into `rag_documents`
6. Chunk `body` using `chunking_json`
7. For each chunk:
   - Insert into `rag_chunks`
   - Insert into `rag_fts_chunks`
   - If embeddings enabled:
     - Build embedding input text using `embedding_json.input`
     - Compute embedding
     - Insert into `rag_vec_chunks`
8. Commit (ideally in a transaction for performance)

---

## 8. What changes later (incremental sync and updates)

v0 is “insert-only and skip-existing.”
Product-grade ingestion requires:

### 8.1 Detecting changes
Options:
- Watermark by `LastActivityDate` / `updated_at` column
- Hash (e.g. `sha256(title||body||metadata)`) stored in documents table
- Compare chunk hashes to re-embed only changed chunks

### 8.2 Updating and deleting
Needs:
- Upsert documents
- Delete or mark `deleted=1` when source row deleted
- Rebuild chunks and indexes when body changes
- Maintain FTS rows:
  - delete old chunk rows from FTS
  - insert updated chunk rows

### 8.3 Checkpoints
Use `rag_sync_state` to store:
- last ingested timestamp
- GTID/LSN for CDC
- or a monotonic PK watermark

The current schema already includes:
- `updated_at` and `deleted`
- `rag_sync_state` placeholder

So incremental sync can be added without breaking the data model.

---

## 9. Practical example: mapping `posts` table

Given a MySQL `posts` row:

- `Id = 12345`
- `Title = "How to parse JSON in MySQL 8?"`
- `Body = "<p>I tried JSON_EXTRACT...</p>"`
- `Tags = "<mysql><json>"`
- `Score = 12`

With mapping:

- `doc_id = "posts:12345"`
- `title = Title`
- `body = Body`
- `metadata_json` includes `{ "Tags": "...", "Score": "12", ... }`
- chunking splits body into:
  - `posts:12345#0`, `posts:12345#1`, etc.
- FTS is populated with the chunk text
- vectors are stored per chunk

---

## 10. Summary

This data model separates concerns cleanly:

- `rag_sources` defines *policy* (what/how to ingest)
- `rag_documents` defines canonical *identity and refetch pointer*
- `rag_chunks` defines retrieval *units*
- `rag_fts_chunks` defines keyword search
- `rag_vec_chunks` defines semantic search

This separation makes the system:
- general purpose (works for many schemas)
- deterministic (no magic inference)
- extensible to incremental sync, external indexes, and richer hybrid retrieval