|
|
1 month ago | |
|---|---|---|
| .. | ||
| API.md | 1 month ago | |
| ARCHITECTURE.md | 1 month ago | |
| EXTERNAL_LLM_SETUP.md | 1 month ago | |
| README.md | 1 month ago | |
| TESTING.md | 1 month ago | |
README.md
Vector Features - Embedding-Based Similarity for ProxySQL
Overview
Vector Features provide semantic similarity capabilities for ProxySQL using vector embeddings and sqlite-vec for efficient similarity search. This enables:
- NL2SQL Vector Cache: Cache natural language queries by semantic meaning, not just exact text
- Anomaly Detection: Detect SQL threats using embedding similarity against known attack patterns
Features
| Feature | Description | Benefit |
|---|---|---|
| Semantic Caching | Cache queries by meaning, not exact text | Higher cache hit rates for similar queries |
| Threat Detection | Detect attacks using embedding similarity | Catch variations of known attack patterns |
| Vector Storage | sqlite-vec for efficient KNN search | Fast similarity queries on embedded vectors |
| GenAI Integration | Uses existing GenAI module for embeddings | No external embedding service required |
| Configurable Thresholds | Adjust similarity sensitivity | Balance between false positives and negatives |
Architecture
Query Input
|
v
+-----------------+
| GenAI Module | -> Generate 1536-dim embedding
| (llama-server) |
+-----------------+
|
v
+-----------------+
| Vector DB | -> Store embedding in SQLite
| (sqlite-vec) | -> Similarity search via KNN
+-----------------+
|
v
+-----------------+
| Result | -> Similar items within threshold
+-----------------+
Quick Start
1. Enable AI Features
-- Via admin interface
SET ai_features_enabled='true';
LOAD MYSQL VARIABLES TO RUNTIME;
2. Configure Vector Database
-- Set vector DB path (default: /var/lib/proxysql/ai_features.db)
SET ai_vector_db_path='/var/lib/proxysql/ai_features.db';
-- Set vector dimension (default: 1536 for text-embedding-3-small)
SET ai_vector_dimension='1536';
3. Configure NL2SQL Vector Cache
-- Enable NL2SQL
SET ai_nl2sql_enabled='true';
-- Set cache similarity threshold (0-100, default: 85)
SET ai_nl2sql_cache_similarity_threshold='85';
4. Configure Anomaly Detection
-- Enable anomaly detection
SET ai_anomaly_detection_enabled='true';
-- Set similarity threshold (0-100, default: 85)
SET ai_anomaly_similarity_threshold='85';
-- Set risk threshold (0-100, default: 70)
SET ai_anomaly_risk_threshold='70';
NL2SQL Vector Cache
How It Works
- User submits NL2SQL query:
NL2SQL: Show all customers - Generate embedding: Query text → 1536-dimensional vector
- Search cache: Find semantically similar cached queries
- Return cached SQL if similarity > threshold
- Otherwise call LLM and store result in cache
Configuration Variables
| Variable | Default | Description |
|---|---|---|
ai_nl2sql_enabled |
true | Enable/disable NL2SQL |
ai_nl2sql_cache_similarity_threshold |
85 | Semantic similarity threshold (0-100) |
ai_nl2sql_timeout_ms |
30000 | LLM request timeout |
ai_vector_db_path |
/var/lib/proxysql/ai_features.db | Vector database file path |
ai_vector_dimension |
1536 | Embedding dimension |
Example: Semantic Cache Hit
-- First query - calls LLM
NL2SQL: Show me all customers from USA;
-- Similar query - returns cached result (no LLM call!)
NL2SQL: Display customers in the United States;
-- Another similar query - cached
NL2SQL: List USA customers;
All three queries are semantically similar and will hit the cache after the first one.
Cache Statistics
-- View cache statistics
SHOW STATUS LIKE 'ai_nl2sql_cache_%';
Anomaly Detection
How It Works
- Query intercepted during session processing
- Generate embedding of normalized query
- KNN search against threat pattern embeddings
- Calculate risk score:
(severity / 10) * (1 - distance / 2) - Block or flag if risk > threshold
Configuration Variables
| Variable | Default | Description |
|---|---|---|
ai_anomaly_detection_enabled |
true | Enable/disable anomaly detection |
ai_anomaly_similarity_threshold |
85 | Similarity threshold for threat matching (0-100) |
ai_anomaly_risk_threshold |
70 | Risk score threshold for blocking (0-100) |
ai_anomaly_rate_limit |
100 | Max anomalies per minute before rate limiting |
ai_anomaly_auto_block |
true | Automatically block high-risk queries |
ai_anomaly_log_only |
false | If true, log but don't block |
Threat Pattern Management
Add a Threat Pattern
Via C++ API:
anomaly_detector->add_threat_pattern(
"OR 1=1 Tautology",
"SELECT * FROM users WHERE username='admin' OR 1=1--'",
"sql_injection",
9 // severity 1-10
);
Via MCP (future):
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "ai_add_threat_pattern",
"arguments": {
"pattern_name": "OR 1=1 Tautology",
"query_example": "SELECT * FROM users WHERE username='admin' OR 1=1--'",
"pattern_type": "sql_injection",
"severity": 9
}
}
}
List Threat Patterns
std::string patterns = anomaly_detector->list_threat_patterns();
// Returns JSON array of all patterns
Remove a Threat Pattern
bool success = anomaly_detector->remove_threat_pattern(pattern_id);
Built-in Threat Patterns
See scripts/add_threat_patterns.sh for 10 example threat patterns:
| Pattern | Type | Severity |
|---|---|---|
| OR 1=1 Tautology | sql_injection | 9 |
| UNION SELECT | sql_injection | 8 |
| Comment Injection | sql_injection | 7 |
| Sleep-based DoS | dos | 6 |
| Benchmark-based DoS | dos | 6 |
| INTO OUTFILE | data_exfiltration | 9 |
| DROP TABLE | privilege_escalation | 10 |
| Schema Probing | reconnaissance | 3 |
| CONCAT Injection | sql_injection | 8 |
| Hex Encoding | sql_injection | 7 |
Detection Example
-- Known threat pattern in database:
-- "SELECT * FROM users WHERE id=1 OR 1=1--"
-- Attacker tries variation:
SELECT * FROM users WHERE id=5 OR 2=2--';
-- Embedding similarity detects this as similar to OR 1=1 pattern
-- Risk score: (9/10) * (1 - 0.15/2) = 0.86 (86% risk)
-- Since 86 > 70 (risk_threshold), query is BLOCKED
Anomaly Statistics
-- View anomaly statistics
SHOW STATUS LIKE 'ai_anomaly_%';
-- ai_detected_anomalies
-- ai_blocked_queries
-- ai_flagged_queries
Via API:
std::string stats = anomaly_detector->get_statistics();
// Returns JSON with detailed statistics
Vector Database
Schema
The vector database (ai_features.db) contains:
Main Tables
nl2sql_cache
CREATE TABLE nl2sql_cache (
id INTEGER PRIMARY KEY AUTOINCREMENT,
natural_language TEXT NOT NULL,
generated_sql TEXT NOT NULL,
schema_context TEXT,
embedding BLOB,
hit_count INTEGER DEFAULT 0,
last_hit INTEGER,
created_at INTEGER
);
anomaly_patterns
CREATE TABLE anomaly_patterns (
id INTEGER PRIMARY KEY AUTOINCREMENT,
pattern_name TEXT,
pattern_type TEXT, -- 'sql_injection', 'dos', 'privilege_escalation'
query_example TEXT,
embedding BLOB,
severity INTEGER, -- 1-10
created_at INTEGER
);
query_history
CREATE TABLE query_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
query_text TEXT NOT NULL,
generated_sql TEXT,
embedding BLOB,
execution_time_ms INTEGER,
success BOOLEAN,
timestamp INTEGER
);
Virtual Vector Tables (sqlite-vec)
CREATE VIRTUAL TABLE nl2sql_cache_vec USING vec0(
embedding float(1536)
);
CREATE VIRTUAL TABLE anomaly_patterns_vec USING vec0(
embedding float(1536)
);
CREATE VIRTUAL TABLE query_history_vec USING vec0(
embedding float(1536)
);
Similarity Search Algorithm
Cosine Distance is used for similarity measurement:
distance = 2 * (1 - cosine_similarity)
where:
cosine_similarity = (A . B) / (|A| * |B|)
Distance range: 0 (identical) to 2 (opposite)
Similarity = (2 - distance) / 2 * 100
Threshold Conversion:
similarity_threshold (0-100) → distance_threshold (0-2)
distance_threshold = 2.0 - (similarity_threshold / 50.0)
Example:
similarity = 85 → distance = 2.0 - (85/50.0) = 0.3
KNN Search Example
-- Find similar cached queries
SELECT c.natural_language, c.generated_sql,
vec_distance_cosine(v.embedding, '[0.1, 0.2, ...]') as distance
FROM nl2sql_cache c
JOIN nl2sql_cache_vec v ON c.id = v.rowid
WHERE v.embedding MATCH '[0.1, 0.2, ...]'
AND distance < 0.3
ORDER BY distance
LIMIT 1;
GenAI Integration
Vector Features use the existing GenAI Module for embedding generation.
Embedding Endpoint
- Module:
lib/GenAI_Thread.cpp - Global Handler:
GenAI_Threads_Handler *GloGATH - Method:
embed_documents({text}) - Returns:
GenAI_EmbeddingResultwithfloat* data,embedding_size,count
Configuration
GenAI module connects to llama-server for embeddings:
// Endpoint: http://127.0.0.1:8013/embedding
// Model: nomic-embed-text-v1.5 (or similar)
// Dimension: 1536
Memory Management
// GenAI returns malloc'd data - must free after copying
GenAI_EmbeddingResult result = GloGATH->embed_documents({text});
std::vector<float> embedding(result.data, result.data + result.embedding_size);
free(result.data); // Important: free the original data
Performance
Embedding Generation
| Operation | Time | Notes |
|---|---|---|
| Generate embedding | ~100-300ms | Via llama-server (local) |
| Vector cache search | ~10-50ms | KNN search with sqlite-vec |
| Pattern similarity check | ~10-50ms | KNN search with sqlite-vec |
Cache Benefits
- Cache hit: ~10-50ms (vs 1-5s for LLM call)
- Semantic matching: Higher hit rate than exact text cache
- Reduced LLM costs: Fewer API calls to cloud providers
Storage
- Embedding size: 1536 floats × 4 bytes = ~6 KB per query
- 1000 cached queries: ~6 MB + overhead
- 100 threat patterns: ~600 KB
Troubleshooting
Vector Features Not Working
-
Check AI features enabled:
SELECT * FROM runtime_mysql_servers WHERE variable_name LIKE 'ai_%_enabled'; -
Check vector DB exists:
ls -la /var/lib/proxysql/ai_features.db -
Check GenAI handler initialized:
tail -f proxysql.log | grep GenAI -
Check llama-server running:
curl http://127.0.0.1:8013/embedding
Poor Similarity Detection
-
Adjust thresholds:
-- Lower threshold = more sensitive (more false positives) SET ai_anomaly_similarity_threshold='80'; -
Add more threat patterns:
anomaly_detector->add_threat_pattern(...); -
Check embedding quality:
- Ensure llama-server is using a good embedding model
- Verify query normalization is working
Cache Issues
-- Clear cache (via API, not SQL yet)
anomaly_detector->clear_cache();
-- Check cache statistics
SHOW STATUS LIKE 'ai_nl2sql_cache_%';
Security Considerations
- Embeddings are stored locally in SQLite database
- No external API calls for similarity search
- Threat patterns are user-defined - ensure proper access control
- Risk scores are heuristic - tune thresholds for your environment
Future Enhancements
- Automatic threat pattern learning from flagged queries
- Embedding model fine-tuning for SQL domain
- Distributed vector storage for large-scale deployments
- Real-time embedding updates for adaptive learning
- Multi-lingual support for embeddings
API Reference
See API.md for complete API documentation.
Architecture Details
See ARCHITECTURE.md for detailed architecture documentation.
Testing Guide
See TESTING.md for testing instructions.