You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/doc/posts-embeddings-setup.md

343 lines
9.9 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Posts Table Embeddings Setup Guide
This guide explains how to set up and populate virtual tables for storing and searching embeddings of the Posts table content using sqlite-rembed and sqlite-vec extensions in ProxySQL.
## Prerequisites
1. **ProxySQL** running with SQLite3 backend enabled (`--sqlite3-server` flag)
2. **Posts table** copied from MySQL to SQLite3 server (248,905 rows)
- Use `scripts/copy_stackexchange_Posts_mysql_to_sqlite3.py` if not already copied
3. **Valid API credentials** for embedding generation
4. **Network access** to embedding API endpoint
## Setup Steps
### Step 1: Create Virtual Vector Table
Create a virtual table for storing 768-dimensional embeddings (matching nomic-embed-text-v1.5 model output):
```sql
-- Create virtual vector table for Posts embeddings
CREATE VIRTUAL TABLE Posts_embeddings USING vec0(
embedding float[768]
);
```
### Step 2: Configure API Client
Configure an embedding API client using the `temp.rembed_clients` virtual table:
```sql
-- Configure embedding API client
-- Replace YOUR_API_KEY with actual API key
INSERT INTO temp.rembed_clients(name, options) VALUES
('posts-embed-client',
rembed_client_options(
'format', 'openai',
'url', 'https://api.synthetic.new/openai/v1/embeddings',
'key', 'YOUR_API_KEY',
'model', 'hf:nomic-ai/nomic-embed-text-v1.5'
)
);
```
### Step 3: Generate and Insert Embeddings
#### For Testing (First 100 rows)
```sql
-- Generate embeddings for first 100 Posts
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT rowid, rembed('posts-embed-client',
COALESCE(Title || ' ', '') || Body) as embedding
FROM Posts
LIMIT 100;
```
#### For Full Table (Batch Processing)
Use this optimized batch query that processes unembedded rows without requiring rowid tracking:
```sql
-- Batch process unembedded rows (processes ~1000 rows at a time)
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT Posts.rowid, rembed('posts-embed-client',
COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
FROM Posts
LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
WHERE Posts_embeddings.rowid IS NULL
LIMIT 1000;
```
**Key features of this batch query:**
- Uses `LEFT JOIN` to find Posts without existing embeddings
- `WHERE Posts_embeddings.rowid IS NULL` filters for unprocessed rows
- `LIMIT 1000` controls batch size
- Can be run repeatedly until all rows are processed
- No need to track which rowids have been processed
### Step 4: Verify Embeddings
```sql
-- Check total embeddings count
SELECT COUNT(*) as total_embeddings FROM Posts_embeddings;
-- Check embedding size (should be 3072 bytes: 768 dimensions × 4 bytes)
SELECT rowid, length(embedding) as embedding_size_bytes
FROM Posts_embeddings LIMIT 3;
-- Check percentage of Posts with embeddings
SELECT
(SELECT COUNT(*) FROM Posts_embeddings) as with_embeddings,
(SELECT COUNT(*) FROM Posts) as total_posts,
ROUND(
(SELECT COUNT(*) FROM Posts_embeddings) * 100.0 /
(SELECT COUNT(*) FROM Posts), 2
) as percentage_complete;
```
## Batch Processing Strategy for 248,905 Rows
### Recommended Approach
1. **Run the batch query repeatedly** until all rows have embeddings
2. **Add delays between batches** to avoid API rate limiting
3. **Monitor progress** using the verification queries above
### Example Shell Script for Batch Processing
```bash
#!/bin/bash
# process_posts_embeddings.sh
PROXYSQL_HOST="127.0.0.1"
PROXYSQL_PORT="6030"
MYSQL_USER="root"
MYSQL_PASS="root"
BATCH_SIZE=1000
DELAY_SECONDS=5
echo "Starting Posts embeddings generation..."
while true; do
# Execute batch query
mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" << EOF
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT Posts.rowid, rembed('posts-embed-client',
COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
FROM Posts
LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
WHERE Posts_embeddings.rowid IS NULL
LIMIT $BATCH_SIZE;
EOF
# Check if any rows were processed
PROCESSED=$(mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" -s -N << EOF
SELECT COUNT(*) FROM Posts_embeddings;
EOF)
TOTAL=$(mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" -s -N << EOF
SELECT COUNT(*) FROM Posts;
EOF)
PERCENTAGE=$(echo "scale=2; $PROCESSED * 100 / $TOTAL" | bc)
echo "Processed: $PROCESSED/$TOTAL rows ($PERCENTAGE%)"
# Break if all rows processed
if [ "$PROCESSED" -eq "$TOTAL" ]; then
echo "All rows processed!"
break
fi
# Wait before next batch
echo "Waiting $DELAY_SECONDS seconds before next batch..."
sleep $DELAY_SECONDS
done
```
## Similarity Search Examples
Once embeddings are generated, you can perform semantic search:
### Example 1: Find Similar Posts
```sql
-- Find Posts similar to a query about databases
SELECT p.SiteId, p.Id as PostId, p.Title, e.distance,
substr(p.Body, 1, 100) as body_preview
FROM (
SELECT rowid, distance
FROM Posts_embeddings
WHERE embedding MATCH rembed('posts-embed-client',
'database systems and SQL queries')
LIMIT 5
) e
JOIN Posts p ON e.rowid = p.rowid
ORDER BY e.distance;
```
### Example 2: Find Posts Similar to Specific Post
```sql
-- Find Posts similar to Post with ID 1
SELECT p2.SiteId, p2.Id as PostId, p2.Title, e.distance,
substr(p2.Body, 1, 100) as body_preview
FROM (
SELECT rowid, distance
FROM Posts_embeddings
WHERE embedding MATCH (
SELECT embedding
FROM Posts_embeddings
WHERE rowid = 1 -- Change to target Post rowid
)
AND rowid != 1
LIMIT 5
) e
JOIN Posts p2 ON e.rowid = p2.rowid
ORDER BY e.distance;
```
### Example 3: Find Posts About "What is ProxySQL?" with Correct LIMIT Syntax
When using `sqlite-vec`'s `MATCH` operator for similarity search, **you must include a `LIMIT` clause (or `k = ?` constraint) in the same query level as the `MATCH`**. This tells the extension how many nearest neighbors to return.
**Common error**: `ERROR 1045 (28000): A LIMIT or 'k = ?' constraint is required on vec0 knn queries.`
**Correct query**:
```sql
-- Find Posts about "What is ProxySQL?" using semantic similarity
SELECT
p.Id,
p.Title,
SUBSTR(p.Body, 1, 200) AS Excerpt,
e.distance
FROM (
-- LIMIT must be in the subquery that contains MATCH
SELECT rowid, distance
FROM Posts_embeddings
WHERE embedding MATCH rembed('posts-embed-client', 'What is ProxySQL?')
ORDER BY distance ASC
LIMIT 10 -- REQUIRED for vec0 KNN queries
) e
JOIN Posts p ON e.rowid = p.rowid
ORDER BY e.distance ASC;
```
**Alternative using `k = ?` constraint** (instead of `LIMIT`):
```sql
SELECT p.Id, p.Title, e.distance
FROM (
SELECT rowid, distance
FROM Posts_embeddings
WHERE embedding MATCH rembed('posts-embed-client', 'What is ProxySQL?')
AND k = 10 -- Alternative to LIMIT constraint
ORDER BY distance ASC
) e
JOIN Posts p ON e.rowid = p.rowid
ORDER BY e.distance ASC;
```
**Key rules**:
1. `LIMIT` or `k = ?` must be in the same query level as `MATCH`
2. Cannot use both `LIMIT` and `k = ?` together choose one
3. When joining, put `MATCH` + `LIMIT` in a subquery
4. The constraint tells `sqlite-vec` how many similar vectors to return
## Performance Considerations
1. **API Rate Limiting**: The `rembed()` function makes HTTP requests to the API
- Batch size of 1000 with 5-second delays is conservative
- Adjust based on API rate limits
- Monitor API usage and costs
2. **Embedding Storage**:
- Each embedding: 768 dimensions × 4 bytes = 3,072 bytes
- Full table (248,905 rows): ~765 MB
- Ensure sufficient disk space
3. **Search Performance**:
- `vec0` virtual tables use approximate nearest neighbor search
- Performance scales with number of vectors and dimensions
- Use `LIMIT` clauses to control result size
## Troubleshooting
### Common Issues
1. **API Connection Errors**
- Verify API key is valid and has quota
- Check network connectivity to API endpoint
- Confirm API endpoint URL is correct
2. **Embedding Generation Failures**
- Check `temp.rembed_clients` configuration
- Verify client name matches in `rembed()` calls
- Test with simple text first: `SELECT rembed('posts-embed-client', 'test');`
3. **Batch Processing Stalls**
- Check if API rate limits are being hit
- Increase delay between batches
- Reduce batch size
4. **Memory Issues**
- Large batches may consume significant memory
- Reduce batch size if encountering memory errors
- Monitor ProxySQL memory usage
### Verification Queries
```sql
-- Check API client configuration
SELECT name, json_extract(options, '$.format') as format,
json_extract(options, '$.model') as model
FROM temp.rembed_clients;
-- Test embedding generation
SELECT length(rembed('posts-embed-client', 'test text')) as test_embedding_size;
-- Check for embedding generation errors
SELECT rowid FROM Posts_embeddings WHERE length(embedding) != 3072;
```
## Maintenance
### Adding New Posts
When new Posts are added to the table:
```sql
-- Generate embeddings for new Posts
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT Posts.rowid, rembed('posts-embed-client',
COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
FROM Posts
LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
WHERE Posts_embeddings.rowid IS NULL;
```
### Recreating Virtual Table
If you need to recreate the virtual table:
```sql
-- Drop existing table
DROP TABLE IF EXISTS Posts_embeddings;
-- Recreate with same schema
CREATE VIRTUAL TABLE Posts_embeddings USING vec0(
embedding float[768]
);
```
## Related Resources
1. [sqlite-rembed Integration Documentation](./sqlite-rembed-integration.md)
2. [SQLite3 Server Documentation](./SQLite3-Server.md)
3. [Vector Search Testing](../doc/vector-search-test/README.md)
4. [Copy Script](../scripts/copy_stackexchange_Posts_mysql_to_sqlite3.py)
---
*Last Updated: $(date)*