mirror of https://github.com/sysown/proxysql
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
750 lines
21 KiB
750 lines
21 KiB
# RAG Ingestion Tool - Usage Guide (MySQL Protocol Version)
|
|
|
|
## Overview
|
|
|
|
`rag_ingest` reads data from MySQL, transforms it, chunks documents, builds full-text search indexes, and optionally generates vector embeddings for semantic search.
|
|
|
|
**This version connects via MySQL protocol to a server that has SQLite as backend** (ProxySQL SQLite3 Server on port 6030). All SQLite queries, FTS5, and vec0 extensions work transparently through the gateway.
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# 1. Build the tool (from repository root)
|
|
cd RAG_POC
|
|
make
|
|
|
|
# 2. Initialize the RAG database schema
|
|
./rag_ingest init --host=127.0.0.1 --port=6030 --user=root --password=root --database=rag_db
|
|
|
|
# 3. Configure your data source (via MySQL protocol)
|
|
mysql -h 127.0.0.1 -P 6030 -u root -proot rag_db < setup_source.sql
|
|
|
|
# 4. Run ingestion
|
|
./rag_ingest ingest --host=127.0.0.1 --port=6030 --user=root --password=root --database=rag_db
|
|
|
|
# 5. For detailed logging (optional)
|
|
./rag_ingest ingest --log-level=debug --host=127.0.0.1 -P 6030 -u root -p root -D rag_db
|
|
```
|
|
|
|
---
|
|
|
|
## Step-by-Step Guide
|
|
|
|
### Step 1: Initialize the RAG Database
|
|
|
|
```bash
|
|
# Using MySQL-style long options
|
|
./rag_ingest init \
|
|
--host=127.0.0.1 \
|
|
--port=6030 \
|
|
--user=root \
|
|
--password=root \
|
|
--database=rag_index
|
|
|
|
# Using short options
|
|
./rag_ingest init -h 127.0.0.1 -P 6030 -u root -p root -D rag_index
|
|
|
|
# Using defaults (host=127.0.0.1, port=6030)
|
|
./rag_ingest init -u root -p root -D rag_index
|
|
```
|
|
|
|
**What happens:**
|
|
- Connects to SQLite3 Server via MySQL protocol (default: 127.0.0.1:6030)
|
|
- Creates RAG schema tables if they don't exist
|
|
- Creates FTS5 full-text search indexes
|
|
- Creates vec0 vector similarity search indexes
|
|
|
|
### Step 2: Configure Your Data Source
|
|
|
|
Insert a source configuration into `rag_sources`:
|
|
|
|
```sql
|
|
-- Minimal configuration (no chunking, no embeddings)
|
|
INSERT INTO rag_sources (
|
|
name,
|
|
enabled,
|
|
backend_type,
|
|
backend_host,
|
|
backend_port,
|
|
backend_user,
|
|
backend_pass,
|
|
backend_db,
|
|
table_name,
|
|
pk_column
|
|
) VALUES (
|
|
'my_mysql_data', -- Human-readable name
|
|
1, -- enabled (1=enabled, 0=disabled)
|
|
'mysql', -- backend type (only 'mysql' supported)
|
|
'127.0.0.1', -- MySQL host
|
|
3306, -- MySQL port
|
|
'root', -- MySQL username
|
|
'mypassword', -- MySQL password
|
|
'my_database', -- MySQL database name
|
|
'posts', -- Table name to read from
|
|
'Id' -- Primary key column
|
|
);
|
|
```
|
|
|
|
### Step 3: Run Ingestion
|
|
|
|
```bash
|
|
./rag_ingest ingest --host=127.0.0.1 --port=6030 --user=root --password=root --database=rag_index
|
|
```
|
|
|
|
**What happens:**
|
|
1. Connects to MySQL backend using credentials from `rag_sources`
|
|
2. Executes `SELECT * FROM posts`
|
|
3. For each row:
|
|
- Creates a document in `rag_documents` (via MySQL protocol to SQLite backend)
|
|
- Creates a chunk in `rag_chunks` (1 per document when chunking disabled)
|
|
- Creates FTS entry in `rag_fts_chunks`
|
|
4. Updates `rag_sync_state` with the max primary key value
|
|
|
|
---
|
|
|
|
## Command-Line Options
|
|
|
|
### Logging
|
|
|
|
Control log verbosity with `--log-level` (available for all commands):
|
|
|
|
```bash
|
|
--log-level=LEVEL
|
|
```
|
|
|
|
| Level | Output | Use Case |
|
|
|-------|--------|----------|
|
|
| `error` | Only errors | Production scripts, minimal logging |
|
|
| `warn` | Warnings + errors | Detect issues without verbose output |
|
|
| `info` | **Default** | Progress, statistics, key events |
|
|
| `debug` | Detailed info | SQL queries, configuration values, diagnostics |
|
|
| `trace` | Everything | Fine-grained function entry/exit, development |
|
|
|
|
**Examples:**
|
|
```bash
|
|
# Minimal output (errors only)
|
|
./rag_ingest ingest --log-level=error --host=127.0.0.1 --port=6030 --user=root --password=root --database=rag_db
|
|
|
|
# Default (info level)
|
|
./rag_ingest ingest --host=127.0.0.1 --port=6030 --user=root --password=root --database=rag_db
|
|
|
|
# Detailed debugging
|
|
./rag_ingest ingest --log-level=debug --host=127.0.0.1 --port=6030 --user=root --password=root --database=rag_db
|
|
|
|
# Maximum verbosity
|
|
./rag_ingest ingest --log-level=trace --host=127.0.0.1 --port=6030 --user=root --password=root --database=rag_db
|
|
```
|
|
|
|
**Output Format:**
|
|
- Timestamps: `[YYYY-MM-DD HH:MM:SS]`
|
|
- Log levels: `[ERROR]`, `[WARN]`, `[INFO]`, `[DEBUG]`, `[TRACE]`
|
|
- Color-coded (ANSI colors for terminal output)
|
|
|
|
### init
|
|
|
|
Initialize database schema.
|
|
|
|
```bash
|
|
./rag_ingest init [OPTIONS]
|
|
|
|
Common Options:
|
|
-h, --host=name Connect to host (default: 127.0.0.1)
|
|
-P, --port=# Port number to use (default: 6030)
|
|
-u, --user=name User for login
|
|
-p, --password=name Password to use
|
|
-D, --database=name Database to use (required)
|
|
-?, --help Show this help message
|
|
|
|
Logging Options:
|
|
--log-level=LEVEL Log verbosity: error, warn, info, debug, trace (default: info)
|
|
|
|
Init Options:
|
|
--vec-dim=# Vector dimension for rag_vec_chunks table (default: 1536)
|
|
```
|
|
|
|
### ingest
|
|
|
|
Run ingestion from configured sources.
|
|
|
|
```bash
|
|
./rag_ingest ingest [OPTIONS]
|
|
|
|
Common Options:
|
|
-h, --host=name Connect to host (default: 127.0.0.1)
|
|
-P, --port=# Port number to use (default: 6030)
|
|
-u, --user=name User for login
|
|
-p, --password=name Password to use
|
|
-D, --database=name Database to use (required)
|
|
-?, --help Show this help message
|
|
|
|
Logging Options:
|
|
--log-level=LEVEL Log verbosity: error, warn, info, debug, trace (default: info)
|
|
```
|
|
|
|
### query
|
|
|
|
Vector similarity search using embeddings.
|
|
|
|
```bash
|
|
./rag_ingest query [OPTIONS]
|
|
|
|
Common Options:
|
|
-h, --host=name Connect to host (default: 127.0.0.1)
|
|
-P, --port=# Port number to use (default: 6030)
|
|
-u, --user=name User for login
|
|
-p, --password=name Password to use
|
|
-D, --database=name Database to use (required)
|
|
-?, --help Show this help message
|
|
|
|
Logging Options:
|
|
--log-level=LEVEL Log verbosity: error, warn, info, debug, trace (default: info)
|
|
|
|
Query Options:
|
|
-t, --text=text Query text to search for (required)
|
|
-s, --source-id=# Source ID to search (default: all enabled sources)
|
|
-l, --limit=# Maximum results to return (default: 5)
|
|
```
|
|
|
|
---
|
|
|
|
## Common Configurations
|
|
|
|
### Configuration 1: Basic Ingestion (No Chunking, No Embeddings)
|
|
|
|
```sql
|
|
INSERT INTO rag_sources (
|
|
name, enabled, backend_type,
|
|
backend_host, backend_port, backend_user, backend_pass, backend_db,
|
|
table_name, pk_column
|
|
)
|
|
VALUES (
|
|
'basic_source', 1, 'mysql',
|
|
'127.0.0.1', 3306, 'root', 'pass', 'mydb',
|
|
'posts', 'Id'
|
|
);
|
|
|
|
-- chunking_json and embedding_json default to disabled
|
|
```
|
|
|
|
**Result:** 1 chunk per document, FTS only, no vectors.
|
|
|
|
---
|
|
|
|
### Configuration 2: Enable Chunking
|
|
|
|
Chunking splits long documents into smaller pieces for better retrieval precision.
|
|
|
|
```sql
|
|
INSERT INTO rag_sources (
|
|
name, enabled, backend_type,
|
|
backend_host, backend_port, backend_user, backend_pass, backend_db,
|
|
table_name, pk_column, chunking_json
|
|
)
|
|
VALUES (
|
|
'chunked_source', 1, 'mysql',
|
|
'127.0.0.1', 3306, 'root', 'pass', 'mydb',
|
|
'posts', 'Id',
|
|
'{
|
|
"enabled": true,
|
|
"unit": "chars",
|
|
"chunk_size": 4000,
|
|
"overlap": 400,
|
|
"min_chunk_size": 800
|
|
}'
|
|
);
|
|
```
|
|
|
|
**Result:** Documents split into ~4000-character chunks with 400-character overlap.
|
|
|
|
---
|
|
|
|
### Configuration 3: Enable Chunking + Embeddings (Stub)
|
|
|
|
For testing without an external embedding service.
|
|
|
|
```sql
|
|
INSERT INTO rag_sources (
|
|
name, enabled, backend_type,
|
|
backend_host, backend_port, backend_user, backend_pass, backend_db,
|
|
table_name, pk_column, chunking_json, embedding_json
|
|
)
|
|
VALUES (
|
|
'embedded_source_stub', 1, 'mysql',
|
|
'127.0.0.1', 3306, 'root', 'pass', 'mydb',
|
|
'posts', 'Id',
|
|
'{
|
|
"enabled": true,
|
|
"unit": "chars",
|
|
"chunk_size": 4000,
|
|
"overlap": 400,
|
|
"min_chunk_size": 800
|
|
}',
|
|
'{
|
|
"enabled": true,
|
|
"provider": "stub",
|
|
"dim": 1536
|
|
}'
|
|
);
|
|
```
|
|
|
|
**Result:** Pseudo-embeddings generated instantly (no API call). Good for testing.
|
|
|
|
---
|
|
|
|
### Configuration 4: Enable Chunking + Real Embeddings
|
|
|
|
With an OpenAI-compatible embedding service.
|
|
|
|
```sql
|
|
INSERT INTO rag_sources (
|
|
name, enabled, backend_type,
|
|
backend_host, backend_port, backend_user, backend_pass, backend_db,
|
|
table_name, pk_column, chunking_json, embedding_json
|
|
)
|
|
VALUES (
|
|
'embedded_source_real', 1, 'mysql',
|
|
'127.0.0.1', 3306, 'root', 'pass', 'mydb',
|
|
'posts', 'Id',
|
|
'{
|
|
"enabled": true,
|
|
"unit": "chars",
|
|
"chunk_size": 4000,
|
|
"overlap": 400,
|
|
"min_chunk_size": 800
|
|
}',
|
|
'{
|
|
"enabled": true,
|
|
"provider": "openai",
|
|
"api_base": "https://api.openai.com/v1",
|
|
"api_key": "sk-your-api-key",
|
|
"model": "text-embedding-3-small",
|
|
"dim": 1536,
|
|
"batch_size": 16,
|
|
"timeout_ms": 20000
|
|
}'
|
|
);
|
|
```
|
|
|
|
**Result:** Real embeddings generated via OpenAI API in batches of 16.
|
|
|
|
---
|
|
|
|
## Configuration Reference
|
|
|
|
### chunking_json
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `enabled` | boolean | `true` | Enable/disable chunking |
|
|
| `unit` | string | `"chars"` | Unit of measurement (only `"chars"` supported) |
|
|
| `chunk_size` | integer | `4000` | Target size of each chunk |
|
|
| `overlap` | integer | `400` | Overlap between consecutive chunks |
|
|
| `min_chunk_size` | integer | `800` | Minimum size to avoid tiny tail chunks |
|
|
|
|
### embedding_json
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `enabled` | boolean | `false` | Enable/disable embedding generation |
|
|
| `provider` | string | `"stub"` | `"stub"` or `"openai"` |
|
|
| `model` | string | `"unknown"` | Model name (for observability) |
|
|
| `dim` | integer | `1536` | Vector dimension |
|
|
| `api_base` | string | - | API base URL (for `provider="openai"`) |
|
|
| `api_key` | string | - | API authentication key |
|
|
| `batch_size` | integer | `16` | Maximum chunks per API call |
|
|
| `timeout_ms` | integer | `20000` | Request timeout in milliseconds |
|
|
| `input` | object | - | Embedding input template (optional) |
|
|
|
|
### embedding_json.input (Advanced)
|
|
|
|
Controls what text is embedded. Example:
|
|
|
|
```json
|
|
{
|
|
"enabled": true,
|
|
"provider": "openai",
|
|
"dim": 1536,
|
|
"input": {
|
|
"concat": [
|
|
{"col": "Title"},
|
|
{"lit": "\nTags: "},
|
|
{"col": "Tags"},
|
|
{"lit": "\n\n"},
|
|
{"chunk_body": true}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
**Result:** Embeds: `{Title}\nTags: {Tags}\n\n{ChunkBody}`
|
|
|
|
---
|
|
|
|
## Document Transformation (doc_map_json)
|
|
|
|
By default, all columns from the source table are available. To map columns to document fields:
|
|
|
|
```sql
|
|
INSERT INTO rag_sources (
|
|
name, enabled, backend_type,
|
|
backend_host, backend_port, backend_user, backend_pass, backend_db,
|
|
table_name, pk_column, doc_map_json
|
|
)
|
|
VALUES (
|
|
'mapped_source', 1, 'mysql',
|
|
'127.0.0.1', 3306, 'root', 'pass', 'mydb',
|
|
'posts', 'Id',
|
|
'{
|
|
"doc_id": {"format": "posts:{Id}"},
|
|
"title": {"concat": [{"col": "Title"}]},
|
|
"body": {"concat": [{"col": "Content"}]},
|
|
"metadata": {"pick": ["Id", "Score", "Tags"]}
|
|
}'
|
|
);
|
|
```
|
|
|
|
**Result:** Custom mapping from MySQL columns to document fields.
|
|
|
|
---
|
|
|
|
## Filtering (where_sql)
|
|
|
|
Only ingest rows matching a WHERE clause:
|
|
|
|
```sql
|
|
UPDATE rag_sources
|
|
SET where_sql = 'Score >= 7 AND CreationDate >= ''2024-01-01'''
|
|
WHERE source_id = 1;
|
|
```
|
|
|
|
---
|
|
|
|
## Running Ingestion
|
|
|
|
### Single Run
|
|
|
|
```bash
|
|
./rag_ingest ingest -h 127.0.0.1 -P 6030 -u root -p root -D rag_index
|
|
```
|
|
|
|
### Incremental Runs (Watermark)
|
|
|
|
The tool tracks the last processed primary key value in `rag_sync_state`. Subsequent runs only fetch new rows.
|
|
|
|
```bash
|
|
# First run: ingests all rows
|
|
./rag_ingest ingest -h 127.0.0.1 -P 6030 -u root -p root -D rag_index
|
|
|
|
# Second run: only ingests new rows
|
|
./rag_ingest ingest -h 127.0.0.1 -P 6030 -u root -p root -D rag_index
|
|
```
|
|
|
|
---
|
|
|
|
## Transaction Handling
|
|
|
|
### Per-Source Commits
|
|
|
|
Each data source is processed in its own transaction:
|
|
|
|
```text
|
|
Source 1: BEGIN IMMEDIATE → ingest data → COMMIT ✅
|
|
Source 2: BEGIN IMMEDIATE → ingest data → ROLLBACK ❌ (error occurred)
|
|
Source 3: BEGIN IMMEDIATE → ingest data → COMMIT ✅
|
|
```
|
|
|
|
**Benefits:**
|
|
- **Isolated failures**: If source 2 fails, sources 1 and 3 are still committed
|
|
- **Shorter locks**: Each table is only locked during its own ingestion
|
|
- **Better recovery**: Partial progress is preserved on failures
|
|
- **Lower memory**: Changes are flushed per source instead of held until end
|
|
|
|
### Transaction Logging
|
|
|
|
```bash
|
|
./rag_ingest ingest -h 127.0.0.1 -P 6030 -u root -p root -D rag_index
|
|
# Output:
|
|
[INFO] Processing source 1 of 3
|
|
[DEBUG] Starting transaction for source 1...
|
|
[INFO] Committing source 1...
|
|
|
|
[INFO] Processing source 2 of 3
|
|
[DEBUG] Starting transaction for source 2...
|
|
[WARN] Rolling back source 2 due to errors
|
|
|
|
[INFO] Processing source 3 of 3
|
|
[DEBUG] Starting transaction for source 3...
|
|
[INFO] Committing source 3...
|
|
|
|
[INFO] === 'ingest' command complete ===
|
|
Succeeded: 2
|
|
Failed: 1
|
|
```
|
|
|
|
### Multiple Sources Example
|
|
|
|
```sql
|
|
-- Configure multiple sources
|
|
INSERT INTO rag_sources (name, enabled, backend_type, ...)
|
|
VALUES
|
|
('stack_overflow', 1, 'mysql', '127.0.0.1', 3306, ...),
|
|
('github_issues', 1, 'mysql', '127.0.0.1', 3306, ...),
|
|
('discussions', 1, 'mysql', '127.0.0.1', 3306, ...);
|
|
```
|
|
|
|
If `github_issues` fails (e.g., connection timeout), the other two sources are still ingested successfully.
|
|
|
|
---
|
|
|
|
## Monitoring Progress
|
|
|
|
### Default Logging (INFO level)
|
|
|
|
```bash
|
|
./rag_ingest ingest -h 127.0.0.1 -P 6030 -u root -p root -D rag_index
|
|
# Output:
|
|
[2026-01-28 12:34:56] [INFO] === RAG Ingestion Tool Starting ===
|
|
[2026-01-28 12:34:56] [INFO] Loaded 1 enabled source(s)
|
|
[2026-01-28 12:34:57] [INFO] === Starting ingestion for source_id=1, name=my_source ===
|
|
[2026-01-28 12:34:58] [INFO] Backend query returned 10000 row(s) to process
|
|
[2026-01-28 12:35:00] [INFO] Progress: ingested_docs=1000, skipped_docs=50, chunks=4000
|
|
[2026-01-28 12:35:02] [INFO] Progress: ingested_docs=2000, skipped_docs=100, chunks=8000
|
|
[2026-01-28 12:35:10] [INFO] === Source ingestion complete: my_source ===
|
|
[2026-01-28 12:35:10] [INFO] ingested_docs=9850, skipped_docs=150, total_chunks=39400
|
|
[2026-01-28 12:35:10] [INFO] embedding_batches=2463
|
|
```
|
|
|
|
### Detailed Logging (DEBUG level)
|
|
|
|
```bash
|
|
./rag_ingest ingest --log-level=debug -h 127.0.0.1 -P 6030 -u root -p root -D rag_index
|
|
# Output includes:
|
|
# - Connection parameters
|
|
# - SQL queries executed
|
|
# - Configuration parsing (chunking, embeddings)
|
|
# - Per-document operations
|
|
# - Chunk counts per document
|
|
# - Embedding batch operations
|
|
# - Sync state updates
|
|
```
|
|
|
|
### Maximum Verbosity (TRACE level)
|
|
|
|
```bash
|
|
./rag_ingest ingest --log-level=trace -h 127.0.0.1 -P 6030 -u root -p root -D rag_index
|
|
# Output includes EVERYTHING:
|
|
# - Function entry/exit
|
|
# - Individual SQL statement execution
|
|
# - Per-chunk operations
|
|
# - Internal state changes
|
|
```
|
|
|
|
### Progress Indicators
|
|
|
|
| Interval | Trigger | Output |
|
|
|----------|---------|--------|
|
|
| Per-command | Start/end | `=== RAG Ingestion Tool Starting ===` |
|
|
| Per-source | Start/end | `=== Starting ingestion for source_id=X, name=Y ===` |
|
|
| Every 1000 docs | During processing | `Progress: ingested_docs=1000, skipped_docs=50, chunks=4000` |
|
|
| Per-batch | Embeddings | `Generating embeddings for batch of 16 chunks...` |
|
|
| End of source | Summary | `ingested_docs=9850, skipped_docs=150, total_chunks=39400` |
|
|
|
|
### Understanding the Output
|
|
|
|
- **ingested_docs**: New documents added to the index
|
|
- **skipped_docs**: Documents already in the index (not re-processed)
|
|
- **total_chunks**: Total chunks created across all ingested documents
|
|
- **embedding_batches**: Number of embedding API calls made (for embedding-enabled sources)
|
|
|
|
---
|
|
|
|
## Verification
|
|
|
|
```bash
|
|
# Connect to SQLite3 Server via MySQL protocol
|
|
mysql -h 127.0.0.1 -P 6030 -u root -proot rag_index -e "
|
|
-- Check counts
|
|
SELECT 'documents' AS type, COUNT(*) FROM rag_documents
|
|
UNION ALL
|
|
SELECT 'chunks', COUNT(*) FROM rag_chunks
|
|
UNION ALL
|
|
SELECT 'fts_entries', COUNT(*) FROM rag_fts_chunks
|
|
UNION ALL
|
|
SELECT 'vectors', COUNT(*) FROM rag_vec_chunks;
|
|
|
|
-- Check sync state
|
|
SELECT source_id, mode, cursor_json FROM rag_sync_state;
|
|
"
|
|
```
|
|
|
|
---
|
|
|
|
## Common Workflows
|
|
|
|
### Workflow 1: Initial Setup
|
|
|
|
```bash
|
|
# 1. Initialize database
|
|
./rag_ingest init -h 127.0.0.1 -P 6030 -u root -p root -D rag
|
|
|
|
# 2. Add source
|
|
mysql -h 127.0.0.1 -P 6030 -u root -proot rag -e "
|
|
INSERT INTO rag_sources (name, enabled, backend_type,
|
|
backend_host, backend_port, backend_user, backend_pass, backend_db,
|
|
table_name, pk_column, chunking_json)
|
|
VALUES ('my_data', 1, 'mysql',
|
|
'localhost', 3306, 'root', 'pass', 'mydb',
|
|
'posts', 'Id',
|
|
'{\"enabled\":true,\"chunk_size\":4000,\"overlap\":400}');
|
|
"
|
|
|
|
# 3. Ingest
|
|
./rag_ingest ingest -h 127.0.0.1 -P 6030 -u root -p root -D rag
|
|
```
|
|
|
|
### Workflow 2: Re-run with New Configuration
|
|
|
|
```bash
|
|
# 1. Update source configuration
|
|
mysql -h 127.0.0.1 -P 6030 -u root -proot rag -e "
|
|
UPDATE rag_sources
|
|
SET chunking_json='{\"enabled\":true,\"chunk_size\":2000}'
|
|
WHERE source_id=1;
|
|
"
|
|
|
|
# 2. Clear existing data (optional - to re-chunk with new settings)
|
|
mysql -h 127.0.0.1 -P 6030 -u root -proot rag -e "
|
|
DELETE FROM rag_vec_chunks WHERE source_id = 1;
|
|
DELETE FROM rag_sync_state WHERE source_id = 1;
|
|
DELETE FROM rag_chunks WHERE source_id = 1;
|
|
DELETE FROM rag_documents WHERE source_id = 1;
|
|
"
|
|
|
|
# 3. Re-ingest
|
|
./rag_ingest ingest -h 127.0.0.1 -P 6030 -u root -p root -D rag
|
|
```
|
|
|
|
### Workflow 3: Add Embeddings to Existing Data
|
|
|
|
```bash
|
|
# 1. Enable embeddings on existing source
|
|
mysql -h 127.0.0.1 -P 6030 -u root -proot rag -e "
|
|
UPDATE rag_sources
|
|
SET embedding_json='{\"enabled\":true,\"provider\":\"stub\",\"dim\":1536}'
|
|
WHERE source_id=1;
|
|
"
|
|
|
|
# 2. Clear sync state (so it re-processes all rows)
|
|
mysql -h 127.0.0.1 -P 6030 -u root -proot rag -e "
|
|
DELETE FROM rag_sync_state WHERE source_id=1;
|
|
"
|
|
|
|
# 3. Clear vectors only (keep documents and chunks)
|
|
mysql -h 127.0.0.1 -P 6030 -u root -proot rag -e "
|
|
DELETE FROM rag_vec_chunks WHERE source_id=1;
|
|
"
|
|
|
|
# 4. Re-ingest (will skip existing documents, but generate embeddings)
|
|
./rag_ingest ingest -h 127.0.0.1 -P 6030 -u root -p root -D rag
|
|
```
|
|
|
|
**Note:** v0 skips documents that already exist. To regenerate embeddings, clear `rag_documents` or use `WHERE` clause.
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### "MySQL connect failed"
|
|
|
|
- Verify SQLite3 Server is running on port 6030
|
|
- Check credentials are correct
|
|
- Ensure database exists
|
|
|
|
### "MySQL query failed" (backend)
|
|
|
|
- Verify backend MySQL credentials in `rag_sources`
|
|
- Check backend MySQL server is running (default: 127.0.0.1:3306)
|
|
- Verify table and column names exist
|
|
|
|
### "No enabled sources found"
|
|
|
|
- Run: `SELECT * FROM rag_sources WHERE enabled = 1;` via MySQL protocol
|
|
- Ensure `enabled = 1` for your source
|
|
|
|
### "Failed to generate embeddings"
|
|
|
|
- Check `embedding_json` configuration
|
|
- For `provider="openai"`: verify `api_base`, `api_key`, `model`
|
|
- Check network connectivity to embedding service
|
|
- Increase `timeout_ms` if needed
|
|
|
|
### Too much / too little output
|
|
|
|
- Use `--log-level=error` for production scripts (minimal output)
|
|
- Use `--log-level=info` for normal operation (default)
|
|
- Use `--log-level=debug` to see SQL queries and configuration values
|
|
- Use `--log-level=trace` for development and deep troubleshooting
|
|
|
|
### Debugging SQL queries
|
|
|
|
```bash
|
|
# Use --log-level=debug to see all SQL queries being executed
|
|
./rag_ingest ingest --log-level=debug -h 127.0.0.1 -P 6030 -u root -p root -D rag_db
|
|
# Output will include:
|
|
# - SELECT queries to rag_sources, rag_sync_state, rag_documents
|
|
# - INSERT statements for documents, chunks, FTS entries
|
|
# - Backend SELECT query being built and executed
|
|
```
|
|
|
|
### Checking configuration values
|
|
|
|
```bash
|
|
# Use --log-level=debug to see parsed configuration
|
|
./rag_ingest ingest --log-level=debug -h 127.0.0.1 -P 6030 -u root -p root -D rag_db
|
|
# Output includes:
|
|
# - Chunking config: enabled=yes, chunk_size=4000, overlap=400
|
|
# - Embedding config: enabled=yes, provider=openai, model=text-embedding-3-small
|
|
# - Watermark/resync values
|
|
```
|
|
|
|
### Performance issues
|
|
|
|
- Use `--log-level=debug` to see embedding batch operations
|
|
- Check `embedding_batches` count in final summary
|
|
- Reduce `batch_size` in `embedding_json` if API timeouts occur
|
|
- Increase `timeout_ms` for slower embedding services
|
|
|
|
---
|
|
|
|
## Architecture Notes
|
|
|
|
### MySQL Protocol Gateway
|
|
|
|
This version uses ProxySQL's SQLite3 Server as a gateway:
|
|
|
|
```
|
|
rag_ingest --[MySQL Protocol]--> ProxySQL SQLite3 Server (port 6030) --> SQLite Backend
|
|
|
|
|
+-- FTS5 Full-Text Search
|
|
+-- vec0 Vector Similarity
|
|
+-- Standard SQL Queries
|
|
```
|
|
|
|
**Benefits:**
|
|
- No local SQLite file dependencies
|
|
- Centralized RAG index database
|
|
- Concurrent access from multiple clients
|
|
- Same schema and queries work transparently
|
|
|
|
### Backend Data Source
|
|
|
|
The tool connects to a separate MySQL/MariaDB server to fetch source data:
|
|
|
|
```
|
|
rag_ingest --[MySQL Protocol]--> Backend MySQL (port 3306) --> Source Tables
|
|
```
|
|
|
|
This is configured via `rag_sources` table (`backend_host`, `backend_port`, etc.).
|