You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/doc/VECTOR_FEATURES/README.md

472 lines
12 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Vector Features - Embedding-Based Similarity for ProxySQL
## Overview
Vector Features provide **semantic similarity** capabilities for ProxySQL using **vector embeddings** and **sqlite-vec** for efficient similarity search. This enables:
- **NL2SQL Vector Cache**: Cache natural language queries by semantic meaning, not just exact text
- **Anomaly Detection**: Detect SQL threats using embedding similarity against known attack patterns
## Features
| Feature | Description | Benefit |
|---------|-------------|---------|
| **Semantic Caching** | Cache queries by meaning, not exact text | Higher cache hit rates for similar queries |
| **Threat Detection** | Detect attacks using embedding similarity | Catch variations of known attack patterns |
| **Vector Storage** | sqlite-vec for efficient KNN search | Fast similarity queries on embedded vectors |
| **GenAI Integration** | Uses existing GenAI module for embeddings | No external embedding service required |
| **Configurable Thresholds** | Adjust similarity sensitivity | Balance between false positives and negatives |
## Architecture
```
Query Input
|
v
+-----------------+
| GenAI Module | -> Generate 1536-dim embedding
| (llama-server) |
+-----------------+
|
v
+-----------------+
| Vector DB | -> Store embedding in SQLite
| (sqlite-vec) | -> Similarity search via KNN
+-----------------+
|
v
+-----------------+
| Result | -> Similar items within threshold
+-----------------+
```
## Quick Start
### 1. Enable AI Features
```sql
-- Via admin interface
SET ai_features_enabled='true';
LOAD MYSQL VARIABLES TO RUNTIME;
```
### 2. Configure Vector Database
```sql
-- Set vector DB path (default: /var/lib/proxysql/ai_features.db)
SET ai_vector_db_path='/var/lib/proxysql/ai_features.db';
-- Set vector dimension (default: 1536 for text-embedding-3-small)
SET ai_vector_dimension='1536';
```
### 3. Configure NL2SQL Vector Cache
```sql
-- Enable NL2SQL
SET ai_nl2sql_enabled='true';
-- Set cache similarity threshold (0-100, default: 85)
SET ai_nl2sql_cache_similarity_threshold='85';
```
### 4. Configure Anomaly Detection
```sql
-- Enable anomaly detection
SET ai_anomaly_detection_enabled='true';
-- Set similarity threshold (0-100, default: 85)
SET ai_anomaly_similarity_threshold='85';
-- Set risk threshold (0-100, default: 70)
SET ai_anomaly_risk_threshold='70';
```
## NL2SQL Vector Cache
### How It Works
1. **User submits NL2SQL query**: `NL2SQL: Show all customers`
2. **Generate embedding**: Query text → 1536-dimensional vector
3. **Search cache**: Find semantically similar cached queries
4. **Return cached SQL** if similarity > threshold
5. **Otherwise call LLM** and store result in cache
### Configuration Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `ai_nl2sql_enabled` | true | Enable/disable NL2SQL |
| `ai_nl2sql_cache_similarity_threshold` | 85 | Semantic similarity threshold (0-100) |
| `ai_nl2sql_timeout_ms` | 30000 | LLM request timeout |
| `ai_vector_db_path` | /var/lib/proxysql/ai_features.db | Vector database file path |
| `ai_vector_dimension` | 1536 | Embedding dimension |
### Example: Semantic Cache Hit
```sql
-- First query - calls LLM
NL2SQL: Show me all customers from USA;
-- Similar query - returns cached result (no LLM call!)
NL2SQL: Display customers in the United States;
-- Another similar query - cached
NL2SQL: List USA customers;
```
All three queries are **semantically similar** and will hit the cache after the first one.
### Cache Statistics
```sql
-- View cache statistics
SHOW STATUS LIKE 'ai_nl2sql_cache_%';
```
## Anomaly Detection
### How It Works
1. **Query intercepted** during session processing
2. **Generate embedding** of normalized query
3. **KNN search** against threat pattern embeddings
4. **Calculate risk score**: `(severity / 10) * (1 - distance / 2)`
5. **Block or flag** if risk > threshold
### Configuration Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `ai_anomaly_detection_enabled` | true | Enable/disable anomaly detection |
| `ai_anomaly_similarity_threshold` | 85 | Similarity threshold for threat matching (0-100) |
| `ai_anomaly_risk_threshold` | 70 | Risk score threshold for blocking (0-100) |
| `ai_anomaly_rate_limit` | 100 | Max anomalies per minute before rate limiting |
| `ai_anomaly_auto_block` | true | Automatically block high-risk queries |
| `ai_anomaly_log_only` | false | If true, log but don't block |
### Threat Pattern Management
#### Add a Threat Pattern
Via C++ API:
```cpp
anomaly_detector->add_threat_pattern(
"OR 1=1 Tautology",
"SELECT * FROM users WHERE username='admin' OR 1=1--'",
"sql_injection",
9 // severity 1-10
);
```
Via MCP (future):
```json
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "ai_add_threat_pattern",
"arguments": {
"pattern_name": "OR 1=1 Tautology",
"query_example": "SELECT * FROM users WHERE username='admin' OR 1=1--'",
"pattern_type": "sql_injection",
"severity": 9
}
}
}
```
#### List Threat Patterns
```cpp
std::string patterns = anomaly_detector->list_threat_patterns();
// Returns JSON array of all patterns
```
#### Remove a Threat Pattern
```cpp
bool success = anomaly_detector->remove_threat_pattern(pattern_id);
```
### Built-in Threat Patterns
See `scripts/add_threat_patterns.sh` for 10 example threat patterns:
| Pattern | Type | Severity |
|---------|------|----------|
| OR 1=1 Tautology | sql_injection | 9 |
| UNION SELECT | sql_injection | 8 |
| Comment Injection | sql_injection | 7 |
| Sleep-based DoS | dos | 6 |
| Benchmark-based DoS | dos | 6 |
| INTO OUTFILE | data_exfiltration | 9 |
| DROP TABLE | privilege_escalation | 10 |
| Schema Probing | reconnaissance | 3 |
| CONCAT Injection | sql_injection | 8 |
| Hex Encoding | sql_injection | 7 |
### Detection Example
```sql
-- Known threat pattern in database:
-- "SELECT * FROM users WHERE id=1 OR 1=1--"
-- Attacker tries variation:
SELECT * FROM users WHERE id=5 OR 2=2--';
-- Embedding similarity detects this as similar to OR 1=1 pattern
-- Risk score: (9/10) * (1 - 0.15/2) = 0.86 (86% risk)
-- Since 86 > 70 (risk_threshold), query is BLOCKED
```
### Anomaly Statistics
```sql
-- View anomaly statistics
SHOW STATUS LIKE 'ai_anomaly_%';
-- ai_detected_anomalies
-- ai_blocked_queries
-- ai_flagged_queries
```
Via API:
```cpp
std::string stats = anomaly_detector->get_statistics();
// Returns JSON with detailed statistics
```
## Vector Database
### Schema
The vector database (`ai_features.db`) contains:
#### Main Tables
**nl2sql_cache**
```sql
CREATE TABLE nl2sql_cache (
id INTEGER PRIMARY KEY AUTOINCREMENT,
natural_language TEXT NOT NULL,
generated_sql TEXT NOT NULL,
schema_context TEXT,
embedding BLOB,
hit_count INTEGER DEFAULT 0,
last_hit INTEGER,
created_at INTEGER
);
```
**anomaly_patterns**
```sql
CREATE TABLE anomaly_patterns (
id INTEGER PRIMARY KEY AUTOINCREMENT,
pattern_name TEXT,
pattern_type TEXT, -- 'sql_injection', 'dos', 'privilege_escalation'
query_example TEXT,
embedding BLOB,
severity INTEGER, -- 1-10
created_at INTEGER
);
```
**query_history**
```sql
CREATE TABLE query_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
query_text TEXT NOT NULL,
generated_sql TEXT,
embedding BLOB,
execution_time_ms INTEGER,
success BOOLEAN,
timestamp INTEGER
);
```
#### Virtual Vector Tables (sqlite-vec)
```sql
CREATE VIRTUAL TABLE nl2sql_cache_vec USING vec0(
embedding float(1536)
);
CREATE VIRTUAL TABLE anomaly_patterns_vec USING vec0(
embedding float(1536)
);
CREATE VIRTUAL TABLE query_history_vec USING vec0(
embedding float(1536)
);
```
### Similarity Search Algorithm
**Cosine Distance** is used for similarity measurement:
```
distance = 2 * (1 - cosine_similarity)
where:
cosine_similarity = (A . B) / (|A| * |B|)
Distance range: 0 (identical) to 2 (opposite)
Similarity = (2 - distance) / 2 * 100
```
**Threshold Conversion**:
```
similarity_threshold (0-100) → distance_threshold (0-2)
distance_threshold = 2.0 - (similarity_threshold / 50.0)
Example:
similarity = 85 → distance = 2.0 - (85/50.0) = 0.3
```
### KNN Search Example
```sql
-- Find similar cached queries
SELECT c.natural_language, c.generated_sql,
vec_distance_cosine(v.embedding, '[0.1, 0.2, ...]') as distance
FROM nl2sql_cache c
JOIN nl2sql_cache_vec v ON c.id = v.rowid
WHERE v.embedding MATCH '[0.1, 0.2, ...]'
AND distance < 0.3
ORDER BY distance
LIMIT 1;
```
## GenAI Integration
Vector Features use the existing **GenAI Module** for embedding generation.
### Embedding Endpoint
- **Module**: `lib/GenAI_Thread.cpp`
- **Global Handler**: `GenAI_Threads_Handler *GloGATH`
- **Method**: `embed_documents({text})`
- **Returns**: `GenAI_EmbeddingResult` with `float* data`, `embedding_size`, `count`
### Configuration
GenAI module connects to llama-server for embeddings:
```cpp
// Endpoint: http://127.0.0.1:8013/embedding
// Model: nomic-embed-text-v1.5 (or similar)
// Dimension: 1536
```
### Memory Management
```cpp
// GenAI returns malloc'd data - must free after copying
GenAI_EmbeddingResult result = GloGATH->embed_documents({text});
std::vector<float> embedding(result.data, result.data + result.embedding_size);
free(result.data); // Important: free the original data
```
## Performance
### Embedding Generation
| Operation | Time | Notes |
|-----------|------|-------|
| Generate embedding | ~100-300ms | Via llama-server (local) |
| Vector cache search | ~10-50ms | KNN search with sqlite-vec |
| Pattern similarity check | ~10-50ms | KNN search with sqlite-vec |
### Cache Benefits
- **Cache hit**: ~10-50ms (vs 1-5s for LLM call)
- **Semantic matching**: Higher hit rate than exact text cache
- **Reduced LLM costs**: Fewer API calls to cloud providers
### Storage
- **Embedding size**: 1536 floats × 4 bytes = ~6 KB per query
- **1000 cached queries**: ~6 MB + overhead
- **100 threat patterns**: ~600 KB
## Troubleshooting
### Vector Features Not Working
1. **Check AI features enabled**:
```sql
SELECT * FROM runtime_mysql_servers
WHERE variable_name LIKE 'ai_%_enabled';
```
2. **Check vector DB exists**:
```bash
ls -la /var/lib/proxysql/ai_features.db
```
3. **Check GenAI handler initialized**:
```bash
tail -f proxysql.log | grep GenAI
```
4. **Check llama-server running**:
```bash
curl http://127.0.0.1:8013/embedding
```
### Poor Similarity Detection
1. **Adjust thresholds**:
```sql
-- Lower threshold = more sensitive (more false positives)
SET ai_anomaly_similarity_threshold='80';
```
2. **Add more threat patterns**:
```cpp
anomaly_detector->add_threat_pattern(...);
```
3. **Check embedding quality**:
- Ensure llama-server is using a good embedding model
- Verify query normalization is working
### Cache Issues
```sql
-- Clear cache (via API, not SQL yet)
anomaly_detector->clear_cache();
-- Check cache statistics
SHOW STATUS LIKE 'ai_nl2sql_cache_%';
```
## Security Considerations
- **Embeddings are stored locally** in SQLite database
- **No external API calls** for similarity search
- **Threat patterns are user-defined** - ensure proper access control
- **Risk scores are heuristic** - tune thresholds for your environment
## Future Enhancements
- [ ] Automatic threat pattern learning from flagged queries
- [ ] Embedding model fine-tuning for SQL domain
- [ ] Distributed vector storage for large-scale deployments
- [ ] Real-time embedding updates for adaptive learning
- [ ] Multi-lingual support for embeddings
## API Reference
See `API.md` for complete API documentation.
## Architecture Details
See `ARCHITECTURE.md` for detailed architecture documentation.
## Testing Guide
See `TESTING.md` for testing instructions.