9.9 KiB
Posts Table Embeddings Setup Guide
This guide explains how to set up and populate virtual tables for storing and searching embeddings of the Posts table content using sqlite-rembed and sqlite-vec extensions in ProxySQL.
Prerequisites
- ProxySQL running with SQLite3 backend enabled (
--sqlite3-serverflag) - Posts table copied from MySQL to SQLite3 server (248,905 rows)
- Use
scripts/copy_stackexchange_Posts_mysql_to_sqlite3.pyif not already copied
- Use
- Valid API credentials for embedding generation
- Network access to embedding API endpoint
Setup Steps
Step 1: Create Virtual Vector Table
Create a virtual table for storing 768-dimensional embeddings (matching nomic-embed-text-v1.5 model output):
-- Create virtual vector table for Posts embeddings
CREATE VIRTUAL TABLE Posts_embeddings USING vec0(
embedding float[768]
);
Step 2: Configure API Client
Configure an embedding API client using the temp.rembed_clients virtual table:
-- Configure embedding API client
-- Replace YOUR_API_KEY with actual API key
INSERT INTO temp.rembed_clients(name, options) VALUES
('posts-embed-client',
rembed_client_options(
'format', 'openai',
'url', 'https://api.synthetic.new/openai/v1/embeddings',
'key', 'YOUR_API_KEY',
'model', 'hf:nomic-ai/nomic-embed-text-v1.5'
)
);
Step 3: Generate and Insert Embeddings
For Testing (First 100 rows)
-- Generate embeddings for first 100 Posts
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT rowid, rembed('posts-embed-client',
COALESCE(Title || ' ', '') || Body) as embedding
FROM Posts
LIMIT 100;
For Full Table (Batch Processing)
Use this optimized batch query that processes unembedded rows without requiring rowid tracking:
-- Batch process unembedded rows (processes ~1000 rows at a time)
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT Posts.rowid, rembed('posts-embed-client',
COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
FROM Posts
LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
WHERE Posts_embeddings.rowid IS NULL
LIMIT 1000;
Key features of this batch query:
- Uses
LEFT JOINto find Posts without existing embeddings WHERE Posts_embeddings.rowid IS NULLfilters for unprocessed rowsLIMIT 1000controls batch size- Can be run repeatedly until all rows are processed
- No need to track which rowids have been processed
Step 4: Verify Embeddings
-- Check total embeddings count
SELECT COUNT(*) as total_embeddings FROM Posts_embeddings;
-- Check embedding size (should be 3072 bytes: 768 dimensions × 4 bytes)
SELECT rowid, length(embedding) as embedding_size_bytes
FROM Posts_embeddings LIMIT 3;
-- Check percentage of Posts with embeddings
SELECT
(SELECT COUNT(*) FROM Posts_embeddings) as with_embeddings,
(SELECT COUNT(*) FROM Posts) as total_posts,
ROUND(
(SELECT COUNT(*) FROM Posts_embeddings) * 100.0 /
(SELECT COUNT(*) FROM Posts), 2
) as percentage_complete;
Batch Processing Strategy for 248,905 Rows
Recommended Approach
- Run the batch query repeatedly until all rows have embeddings
- Add delays between batches to avoid API rate limiting
- Monitor progress using the verification queries above
Example Shell Script for Batch Processing
#!/bin/bash
# process_posts_embeddings.sh
PROXYSQL_HOST="127.0.0.1"
PROXYSQL_PORT="6030"
MYSQL_USER="root"
MYSQL_PASS="root"
BATCH_SIZE=1000
DELAY_SECONDS=5
echo "Starting Posts embeddings generation..."
while true; do
# Execute batch query
mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" << EOF
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT Posts.rowid, rembed('posts-embed-client',
COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
FROM Posts
LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
WHERE Posts_embeddings.rowid IS NULL
LIMIT $BATCH_SIZE;
EOF
# Check if any rows were processed
PROCESSED=$(mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" -s -N << EOF
SELECT COUNT(*) FROM Posts_embeddings;
EOF)
TOTAL=$(mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" -s -N << EOF
SELECT COUNT(*) FROM Posts;
EOF)
PERCENTAGE=$(echo "scale=2; $PROCESSED * 100 / $TOTAL" | bc)
echo "Processed: $PROCESSED/$TOTAL rows ($PERCENTAGE%)"
# Break if all rows processed
if [ "$PROCESSED" -eq "$TOTAL" ]; then
echo "All rows processed!"
break
fi
# Wait before next batch
echo "Waiting $DELAY_SECONDS seconds before next batch..."
sleep $DELAY_SECONDS
done
Similarity Search Examples
Once embeddings are generated, you can perform semantic search:
Example 1: Find Similar Posts
-- Find Posts similar to a query about databases
SELECT p.SiteId, p.Id as PostId, p.Title, e.distance,
substr(p.Body, 1, 100) as body_preview
FROM (
SELECT rowid, distance
FROM Posts_embeddings
WHERE embedding MATCH rembed('posts-embed-client',
'database systems and SQL queries')
LIMIT 5
) e
JOIN Posts p ON e.rowid = p.rowid
ORDER BY e.distance;
Example 2: Find Posts Similar to Specific Post
-- Find Posts similar to Post with ID 1
SELECT p2.SiteId, p2.Id as PostId, p2.Title, e.distance,
substr(p2.Body, 1, 100) as body_preview
FROM (
SELECT rowid, distance
FROM Posts_embeddings
WHERE embedding MATCH (
SELECT embedding
FROM Posts_embeddings
WHERE rowid = 1 -- Change to target Post rowid
)
AND rowid != 1
LIMIT 5
) e
JOIN Posts p2 ON e.rowid = p2.rowid
ORDER BY e.distance;
Example 3: Find Posts About "What is ProxySQL?" with Correct LIMIT Syntax
When using sqlite-vec's MATCH operator for similarity search, you must include a LIMIT clause (or k = ? constraint) in the same query level as the MATCH. This tells the extension how many nearest neighbors to return.
Common error: ERROR 1045 (28000): A LIMIT or 'k = ?' constraint is required on vec0 knn queries.
Correct query:
-- Find Posts about "What is ProxySQL?" using semantic similarity
SELECT
p.Id,
p.Title,
SUBSTR(p.Body, 1, 200) AS Excerpt,
e.distance
FROM (
-- LIMIT must be in the subquery that contains MATCH
SELECT rowid, distance
FROM Posts_embeddings
WHERE embedding MATCH rembed('posts-embed-client', 'What is ProxySQL?')
ORDER BY distance ASC
LIMIT 10 -- REQUIRED for vec0 KNN queries
) e
JOIN Posts p ON e.rowid = p.rowid
ORDER BY e.distance ASC;
Alternative using k = ? constraint (instead of LIMIT):
SELECT p.Id, p.Title, e.distance
FROM (
SELECT rowid, distance
FROM Posts_embeddings
WHERE embedding MATCH rembed('posts-embed-client', 'What is ProxySQL?')
AND k = 10 -- Alternative to LIMIT constraint
ORDER BY distance ASC
) e
JOIN Posts p ON e.rowid = p.rowid
ORDER BY e.distance ASC;
Key rules:
LIMITork = ?must be in the same query level asMATCH- Cannot use both
LIMITandk = ?together – choose one - When joining, put
MATCH+LIMITin a subquery - The constraint tells
sqlite-vechow many similar vectors to return
Performance Considerations
-
API Rate Limiting: The
rembed()function makes HTTP requests to the API- Batch size of 1000 with 5-second delays is conservative
- Adjust based on API rate limits
- Monitor API usage and costs
-
Embedding Storage:
- Each embedding: 768 dimensions × 4 bytes = 3,072 bytes
- Full table (248,905 rows): ~765 MB
- Ensure sufficient disk space
-
Search Performance:
vec0virtual tables use approximate nearest neighbor search- Performance scales with number of vectors and dimensions
- Use
LIMITclauses to control result size
Troubleshooting
Common Issues
-
API Connection Errors
- Verify API key is valid and has quota
- Check network connectivity to API endpoint
- Confirm API endpoint URL is correct
-
Embedding Generation Failures
- Check
temp.rembed_clientsconfiguration - Verify client name matches in
rembed()calls - Test with simple text first:
SELECT rembed('posts-embed-client', 'test');
- Check
-
Batch Processing Stalls
- Check if API rate limits are being hit
- Increase delay between batches
- Reduce batch size
-
Memory Issues
- Large batches may consume significant memory
- Reduce batch size if encountering memory errors
- Monitor ProxySQL memory usage
Verification Queries
-- Check API client configuration
SELECT name, json_extract(options, '$.format') as format,
json_extract(options, '$.model') as model
FROM temp.rembed_clients;
-- Test embedding generation
SELECT length(rembed('posts-embed-client', 'test text')) as test_embedding_size;
-- Check for embedding generation errors
SELECT rowid FROM Posts_embeddings WHERE length(embedding) != 3072;
Maintenance
Adding New Posts
When new Posts are added to the table:
-- Generate embeddings for new Posts
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT Posts.rowid, rembed('posts-embed-client',
COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
FROM Posts
LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
WHERE Posts_embeddings.rowid IS NULL;
Recreating Virtual Table
If you need to recreate the virtual table:
-- Drop existing table
DROP TABLE IF EXISTS Posts_embeddings;
-- Recreate with same schema
CREATE VIRTUAL TABLE Posts_embeddings USING vec0(
embedding float[768]
);
Related Resources
- sqlite-rembed Integration Documentation
- SQLite3 Server Documentation
- Vector Search Testing
- Copy Script
Last Updated: $(date)