You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/RAG_POC/EMBEDDING_TEST_PLAN.md

6.8 KiB

Embedding Testing Plan (MySQL Protocol Version)

Prerequisites

  1. ProxySQL SQLite3 Server running on port 6030
  2. MySQL server (backend data source) running with test database
  3. OpenAI-compatible embedding service accessible

Quick Start

# From repository root
cd RAG_POC

# Step 1: Set your embedding service credentials
export OPENAI_API_BASE="https://your-embedding-service.com/v1"
export OPENAI_API_KEY="your-api-key-here"
export OPENAI_MODEL="your-model-name"
export OPENAI_EMBEDDING_DIM=1536  # Adjust based on your model

# Step 2: Run the test
./test_rag_ingest_sqlite_server.sh

Configuration Options

OpenAI API

export OPENAI_API_BASE="https://api.openai.com/v1"
export OPENAI_API_KEY="sk-your-openai-key"
export OPENAI_MODEL="text-embedding-3-small"
export OPENAI_EMBEDDING_DIM=1536

Azure OpenAI

export OPENAI_API_BASE="https://your-resource.openai.azure.com/openai/deployments/your-deployment"
export OPENAI_API_KEY="your-azure-key"
export OPENAI_MODEL="text-embedding-ada-002"  # Your deployment name
export OPENAI_EMBEDDING_DIM=1536

Other OpenAI-compatible services

# Any service with OpenAI-compatible API
export OPENAI_API_BASE="https://your-service.com/v1"
export OPENAI_API_KEY="your-key"
export OPENAI_MODEL="model-name"
export OPENAI_EMBEDDING_DIM=dim  # e.g., 768, 1536, 3072

What the Test Does

Phase 4 (Embeddings with stub provider):

  1. Initializes RAG database schema via MySQL protocol
  2. Configures stub embedding provider
  3. Ingests 10 documents from MySQL backend
  4. Generates pseudo-embeddings instantly
  5. Verifies:
    • 10 documents created
    • 10 chunks created
    • 10 embeddings created
    • Vector self-match works (search finds itself)

To test with real embeddings, set the environment variables above and modify the test script to use "provider":"openai" instead of "provider":"stub".


Expected Output

========================================
Phase 4: Enable Embeddings (Stub)
========================================
Done source mysql_posts ingested_docs=10 skipped_docs=0
OK: rag_vec_chunks (embeddings enabled) = 10
OK: Vector self-match = posts:1#0
OK: Vector embeddings and search working

Verification Queries

After the test, manually verify via MySQL protocol:

mysql -h 127.0.0.1 -P 6030 -u root -proot test_rag -e "
-- All chunks have embeddings?
SELECT 'Missing embeddings: ' || COUNT(*) FROM rag_chunks c
LEFT JOIN rag_vec_chunks v ON c.chunk_id = v.chunk_id
WHERE v.chunk_id IS NULL;
-- Expected: 0

-- Sample embeddings
SELECT chunk_id, length(embedding) as embedding_bytes
FROM rag_vec_chunks LIMIT 5;
-- Expected: 6144 bytes per embedding (1536 floats * 4 bytes)

-- Vector similarity test
SELECT chunk_id, distance
FROM rag_vec_chunks
WHERE embedding MATCH (
    SELECT embedding FROM rag_vec_chunks WHERE chunk_id='posts:1#0' LIMIT 1
)
ORDER BY distance LIMIT 3;
"

Architecture

┌─────────────┐                    ┌──────────────────┐                    ┌─────────────┐
│ rag_ingest  │───MySQL Protocol──→│ ProxySQL         │───FTS5/vec0───→│   SQLite    │
│             │    (port 6030)     │ SQLite3 Server   │                 │   Backend   │
└──────┬──────┘                    └──────────────────┘                    └─────────────┘
       │
       │ MySQL Protocol
       ↓
┌──────────────────┐
│ Backend MySQL    │
│ (port 3306)      │
│                  │
│  • Source tables │
└──────────────────┘

Data flow:

  1. rag_ingest connects to SQLite3 Server (port 6030) via MySQL protocol
  2. Stores RAG index (documents, chunks, FTS, vectors) in SQLite backend
  3. Fetches source data from separate MySQL backend (port 3306)
  4. Generates embeddings via HTTP API (OpenAI-compatible)

Troubleshooting

Error: "MySQL connect failed" (SQLite3 Server)

  • Verify ProxySQL is running: ps aux | grep proxysql
  • Check port 6030 is listening: netstat -an | grep 6030
  • Verify credentials: mysql -h 127.0.0.1 -P 6030 -u root -proot

Error: "MySQL query failed" (backend MySQL)

  • Verify backend MySQL is running: mysql -h 127.0.0.1 -P 3306 -u root -proot
  • Check rag_sources configuration: SELECT * FROM rag_sources WHERE enabled=1;

Error: "Failed to generate embeddings"

  • Check OPENAI_API_BASE is correct
  • Check OPENAI_API_KEY is valid
  • Check OPENAI_MODEL exists in your service
  • Check network connectivity to embedding service

Error: "Dimension mismatch" (vec0)

  • Set OPENAI_EMBEDDING_DIM to match your model
  • Common dimensions: 768, 1536, 3072
  • Verify schema has correct vector dimension: SELECT sql FROM sqlite_master WHERE name='rag_vec_chunks';

Timeout errors

  • The default timeout is 30 seconds (configurable in embedding_json)
  • Check network connectivity to embedding service
  • Reduce batch_size if needed

Testing Different Batch Sizes

To test the batching implementation, modify the test script temporarily:

# Edit test_rag_ingest_sqlite_server.sh, find the embedding_json configuration
# Change from:
#   "provider":"stub"
# To:
#   "provider":"openai",
#   "batch_size": 32

Then observe the number of API calls in your embedding service dashboard.


Manual Testing

For interactive testing:

# 1. Initialize database
./rag_ingest init -h 127.0.0.1 -P 6030 -u root -p root -D test_embeddings

# 2. Configure source with embeddings
mysql -h 127.0.0.1 -P 6030 -u root -proot test_embeddings -e "
UPDATE rag_sources
SET embedding_json = '{
    \"enabled\": true,
    \"provider\": \"openai\",
    \"api_base\": \"https://api.openai.com/v1\",
    \"api_key\": \"sk-your-key\",
    \"model\": \"text-embedding-3-small\",
    \"dim\": 1536,
    \"batch_size\": 16
}'
WHERE source_id = 1;
"

# 3. Run ingestion
./rag_ingest ingest -h 127.0.0.1 -P 6030 -u root -p root -D test_embeddings

# 4. Verify
mysql -h 127.0.0.1 -P 6030 -u root -proot test_embeddings -e "
SELECT COUNT(*) as embeddings FROM rag_vec_chunks;
"

After embeddings are generated:

mysql -h 127.0.0.1 -P 6030 -u root -proot test_embeddings -e "
-- Find similar chunks to posts:1#0
SELECT
    c.chunk_id,
    substr(c.body, 1, 60) as content,
    v.distance
FROM rag_vec_chunks v
JOIN rag_chunks c ON c.chunk_id = v.chunk_id
WHERE v.embedding MATCH (
    SELECT embedding FROM rag_vec_chunks WHERE chunk_id='posts:1#0' LIMIT 1
)
ORDER BY v.distance
LIMIT 5;
"

Expected output:

  • posts:1#0 with distance 0.0 (exact match)
  • Other chunks with increasing distances