You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/doc/posts-embeddings-setup.md

9.9 KiB

Posts Table Embeddings Setup Guide

This guide explains how to set up and populate virtual tables for storing and searching embeddings of the Posts table content using sqlite-rembed and sqlite-vec extensions in ProxySQL.

Prerequisites

  1. ProxySQL running with SQLite3 backend enabled (--sqlite3-server flag)
  2. Posts table copied from MySQL to SQLite3 server (248,905 rows)
    • Use scripts/copy_stackexchange_Posts_mysql_to_sqlite3.py if not already copied
  3. Valid API credentials for embedding generation
  4. Network access to embedding API endpoint

Setup Steps

Step 1: Create Virtual Vector Table

Create a virtual table for storing 768-dimensional embeddings (matching nomic-embed-text-v1.5 model output):

-- Create virtual vector table for Posts embeddings
CREATE VIRTUAL TABLE Posts_embeddings USING vec0(
    embedding float[768]
);

Step 2: Configure API Client

Configure an embedding API client using the temp.rembed_clients virtual table:

-- Configure embedding API client
-- Replace YOUR_API_KEY with actual API key
INSERT INTO temp.rembed_clients(name, options) VALUES
  ('posts-embed-client',
   rembed_client_options(
     'format', 'openai',
     'url', 'https://api.synthetic.new/openai/v1/embeddings',
     'key', 'YOUR_API_KEY',
     'model', 'hf:nomic-ai/nomic-embed-text-v1.5'
   )
  );

Step 3: Generate and Insert Embeddings

For Testing (First 100 rows)

-- Generate embeddings for first 100 Posts
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT rowid, rembed('posts-embed-client',
       COALESCE(Title || ' ', '') || Body) as embedding
FROM Posts
LIMIT 100;

For Full Table (Batch Processing)

Use this optimized batch query that processes unembedded rows without requiring rowid tracking:

-- Batch process unembedded rows (processes ~1000 rows at a time)
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT Posts.rowid, rembed('posts-embed-client',
       COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
FROM Posts
LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
WHERE Posts_embeddings.rowid IS NULL
LIMIT 1000;

Key features of this batch query:

  • Uses LEFT JOIN to find Posts without existing embeddings
  • WHERE Posts_embeddings.rowid IS NULL filters for unprocessed rows
  • LIMIT 1000 controls batch size
  • Can be run repeatedly until all rows are processed
  • No need to track which rowids have been processed

Step 4: Verify Embeddings

-- Check total embeddings count
SELECT COUNT(*) as total_embeddings FROM Posts_embeddings;

-- Check embedding size (should be 3072 bytes: 768 dimensions × 4 bytes)
SELECT rowid, length(embedding) as embedding_size_bytes
FROM Posts_embeddings LIMIT 3;

-- Check percentage of Posts with embeddings
SELECT
  (SELECT COUNT(*) FROM Posts_embeddings) as with_embeddings,
  (SELECT COUNT(*) FROM Posts) as total_posts,
  ROUND(
    (SELECT COUNT(*) FROM Posts_embeddings) * 100.0 /
    (SELECT COUNT(*) FROM Posts), 2
  ) as percentage_complete;

Batch Processing Strategy for 248,905 Rows

  1. Run the batch query repeatedly until all rows have embeddings
  2. Add delays between batches to avoid API rate limiting
  3. Monitor progress using the verification queries above

Example Shell Script for Batch Processing

#!/bin/bash
# process_posts_embeddings.sh

PROXYSQL_HOST="127.0.0.1"
PROXYSQL_PORT="6030"
MYSQL_USER="root"
MYSQL_PASS="root"
BATCH_SIZE=1000
DELAY_SECONDS=5

echo "Starting Posts embeddings generation..."

while true; do
  # Execute batch query
  mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" << EOF
    INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
    SELECT Posts.rowid, rembed('posts-embed-client',
           COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
    FROM Posts
    LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
    WHERE Posts_embeddings.rowid IS NULL
    LIMIT $BATCH_SIZE;
EOF

  # Check if any rows were processed
  PROCESSED=$(mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" -s -N << EOF
    SELECT COUNT(*) FROM Posts_embeddings;
EOF)

  TOTAL=$(mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" -s -N << EOF
    SELECT COUNT(*) FROM Posts;
EOF)

  PERCENTAGE=$(echo "scale=2; $PROCESSED * 100 / $TOTAL" | bc)
  echo "Processed: $PROCESSED/$TOTAL rows ($PERCENTAGE%)"

  # Break if all rows processed
  if [ "$PROCESSED" -eq "$TOTAL" ]; then
    echo "All rows processed!"
    break
  fi

  # Wait before next batch
  echo "Waiting $DELAY_SECONDS seconds before next batch..."
  sleep $DELAY_SECONDS
done

Similarity Search Examples

Once embeddings are generated, you can perform semantic search:

Example 1: Find Similar Posts

-- Find Posts similar to a query about databases
SELECT p.SiteId, p.Id as PostId, p.Title, e.distance,
       substr(p.Body, 1, 100) as body_preview
FROM (
  SELECT rowid, distance
  FROM Posts_embeddings
  WHERE embedding MATCH rembed('posts-embed-client',
         'database systems and SQL queries')
  LIMIT 5
) e
JOIN Posts p ON e.rowid = p.rowid
ORDER BY e.distance;

Example 2: Find Posts Similar to Specific Post

-- Find Posts similar to Post with ID 1
SELECT p2.SiteId, p2.Id as PostId, p2.Title, e.distance,
       substr(p2.Body, 1, 100) as body_preview
FROM (
  SELECT rowid, distance
  FROM Posts_embeddings
  WHERE embedding MATCH (
    SELECT embedding
    FROM Posts_embeddings
    WHERE rowid = 1  -- Change to target Post rowid
  )
  AND rowid != 1
  LIMIT 5
) e
JOIN Posts p2 ON e.rowid = p2.rowid
ORDER BY e.distance;

Example 3: Find Posts About "What is ProxySQL?" with Correct LIMIT Syntax

When using sqlite-vec's MATCH operator for similarity search, you must include a LIMIT clause (or k = ? constraint) in the same query level as the MATCH. This tells the extension how many nearest neighbors to return.

Common error: ERROR 1045 (28000): A LIMIT or 'k = ?' constraint is required on vec0 knn queries.

Correct query:

-- Find Posts about "What is ProxySQL?" using semantic similarity
SELECT
  p.Id,
  p.Title,
  SUBSTR(p.Body, 1, 200) AS Excerpt,
  e.distance
FROM (
  -- LIMIT must be in the subquery that contains MATCH
  SELECT rowid, distance
  FROM Posts_embeddings
  WHERE embedding MATCH rembed('posts-embed-client', 'What is ProxySQL?')
  ORDER BY distance ASC
  LIMIT 10          -- REQUIRED for vec0 KNN queries
) e
JOIN Posts p ON e.rowid = p.rowid
ORDER BY e.distance ASC;

Alternative using k = ? constraint (instead of LIMIT):

SELECT p.Id, p.Title, e.distance
FROM (
  SELECT rowid, distance
  FROM Posts_embeddings
  WHERE embedding MATCH rembed('posts-embed-client', 'What is ProxySQL?')
    AND k = 10      -- Alternative to LIMIT constraint
  ORDER BY distance ASC
) e
JOIN Posts p ON e.rowid = p.rowid
ORDER BY e.distance ASC;

Key rules:

  1. LIMIT or k = ? must be in the same query level as MATCH
  2. Cannot use both LIMIT and k = ? together choose one
  3. When joining, put MATCH + LIMIT in a subquery
  4. The constraint tells sqlite-vec how many similar vectors to return

Performance Considerations

  1. API Rate Limiting: The rembed() function makes HTTP requests to the API

    • Batch size of 1000 with 5-second delays is conservative
    • Adjust based on API rate limits
    • Monitor API usage and costs
  2. Embedding Storage:

    • Each embedding: 768 dimensions × 4 bytes = 3,072 bytes
    • Full table (248,905 rows): ~765 MB
    • Ensure sufficient disk space
  3. Search Performance:

    • vec0 virtual tables use approximate nearest neighbor search
    • Performance scales with number of vectors and dimensions
    • Use LIMIT clauses to control result size

Troubleshooting

Common Issues

  1. API Connection Errors

    • Verify API key is valid and has quota
    • Check network connectivity to API endpoint
    • Confirm API endpoint URL is correct
  2. Embedding Generation Failures

    • Check temp.rembed_clients configuration
    • Verify client name matches in rembed() calls
    • Test with simple text first: SELECT rembed('posts-embed-client', 'test');
  3. Batch Processing Stalls

    • Check if API rate limits are being hit
    • Increase delay between batches
    • Reduce batch size
  4. Memory Issues

    • Large batches may consume significant memory
    • Reduce batch size if encountering memory errors
    • Monitor ProxySQL memory usage

Verification Queries

-- Check API client configuration
SELECT name, json_extract(options, '$.format') as format,
       json_extract(options, '$.model') as model
FROM temp.rembed_clients;

-- Test embedding generation
SELECT length(rembed('posts-embed-client', 'test text')) as test_embedding_size;

-- Check for embedding generation errors
SELECT rowid FROM Posts_embeddings WHERE length(embedding) != 3072;

Maintenance

Adding New Posts

When new Posts are added to the table:

-- Generate embeddings for new Posts
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT Posts.rowid, rembed('posts-embed-client',
       COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
FROM Posts
LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
WHERE Posts_embeddings.rowid IS NULL;

Recreating Virtual Table

If you need to recreate the virtual table:

-- Drop existing table
DROP TABLE IF EXISTS Posts_embeddings;

-- Recreate with same schema
CREATE VIRTUAL TABLE Posts_embeddings USING vec0(
    embedding float[768]
);
  1. sqlite-rembed Integration Documentation
  2. SQLite3 Server Documentation
  3. Vector Search Testing
  4. Copy Script

Last Updated: $(date)