proxysql/doc/VECTOR_FEATURES/README.md

# Vector Features - Embedding-Based Similarity for ProxySQL

## Overview

Vector Features provide **semantic similarity** capabilities for ProxySQL using **vector embeddings** and **sqlite-vec** for efficient similarity search. This enables:

- **NL2SQL Vector Cache**: Cache natural language queries by semantic meaning, not just exact text
- **Anomaly Detection**: Detect SQL threats using embedding similarity against known attack patterns

## Features

| Feature | Description | Benefit |
|---------|-------------|---------|
| **Semantic Caching** | Cache queries by meaning, not exact text | Higher cache hit rates for similar queries |
| **Threat Detection** | Detect attacks using embedding similarity | Catch variations of known attack patterns |
| **Vector Storage** | sqlite-vec for efficient KNN search | Fast similarity queries on embedded vectors |
| **GenAI Integration** | Uses existing GenAI module for embeddings | No external embedding service required |
| **Configurable Thresholds** | Adjust similarity sensitivity | Balance between false positives and negatives |

## Architecture

```
Query Input
    |
    v
+-----------------+
| GenAI Module    | -> Generate 1536-dim embedding
| (llama-server)  |
+-----------------+
    |
    v
+-----------------+
| Vector DB       | -> Store embedding in SQLite
| (sqlite-vec)    | -> Similarity search via KNN
+-----------------+
    |
    v
+-----------------+
| Result          | -> Similar items within threshold
+-----------------+
```

## Quick Start

### 1. Enable AI Features

```sql
-- Via admin interface
SET ai_features_enabled='true';
LOAD MYSQL VARIABLES TO RUNTIME;
```

### 2. Configure Vector Database

```sql
-- Set vector DB path (default: /var/lib/proxysql/ai_features.db)
SET ai_vector_db_path='/var/lib/proxysql/ai_features.db';

-- Set vector dimension (default: 1536 for text-embedding-3-small)
SET ai_vector_dimension='1536';
```

### 3. Configure NL2SQL Vector Cache

```sql
-- Enable NL2SQL
SET ai_nl2sql_enabled='true';

-- Set cache similarity threshold (0-100, default: 85)
SET ai_nl2sql_cache_similarity_threshold='85';
```

### 4. Configure Anomaly Detection

```sql
-- Enable anomaly detection
SET ai_anomaly_detection_enabled='true';

-- Set similarity threshold (0-100, default: 85)
SET ai_anomaly_similarity_threshold='85';

-- Set risk threshold (0-100, default: 70)
SET ai_anomaly_risk_threshold='70';
```

## NL2SQL Vector Cache

### How It Works

1. **User submits NL2SQL query**: `NL2SQL: Show all customers`
2. **Generate embedding**: Query text → 1536-dimensional vector
3. **Search cache**: Find semantically similar cached queries
4. **Return cached SQL** if similarity > threshold
5. **Otherwise call LLM** and store result in cache

### Configuration Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `ai_nl2sql_enabled` | true | Enable/disable NL2SQL |
| `ai_nl2sql_cache_similarity_threshold` | 85 | Semantic similarity threshold (0-100) |
| `ai_nl2sql_timeout_ms` | 30000 | LLM request timeout |
| `ai_vector_db_path` | /var/lib/proxysql/ai_features.db | Vector database file path |
| `ai_vector_dimension` | 1536 | Embedding dimension |

### Example: Semantic Cache Hit

```sql
-- First query - calls LLM
NL2SQL: Show me all customers from USA;

-- Similar query - returns cached result (no LLM call!)
NL2SQL: Display customers in the United States;

-- Another similar query - cached
NL2SQL: List USA customers;
```

All three queries are **semantically similar** and will hit the cache after the first one.

### Cache Statistics

```sql
-- View cache statistics
SHOW STATUS LIKE 'ai_nl2sql_cache_%';
```

## Anomaly Detection

### How It Works

1. **Query intercepted** during session processing
2. **Generate embedding** of normalized query
3. **KNN search** against threat pattern embeddings
4. **Calculate risk score**: `(severity / 10) * (1 - distance / 2)`
5. **Block or flag** if risk > threshold

### Configuration Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `ai_anomaly_detection_enabled` | true | Enable/disable anomaly detection |
| `ai_anomaly_similarity_threshold` | 85 | Similarity threshold for threat matching (0-100) |
| `ai_anomaly_risk_threshold` | 70 | Risk score threshold for blocking (0-100) |
| `ai_anomaly_rate_limit` | 100 | Max anomalies per minute before rate limiting |
| `ai_anomaly_auto_block` | true | Automatically block high-risk queries |
| `ai_anomaly_log_only` | false | If true, log but don't block |

### Threat Pattern Management

#### Add a Threat Pattern

Via C++ API:
```cpp
anomaly_detector->add_threat_pattern(
    "OR 1=1 Tautology",
    "SELECT * FROM users WHERE username='admin' OR 1=1--'",
    "sql_injection",
    9  // severity 1-10
);
```

Via MCP (future):
```json
{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "ai_add_threat_pattern",
    "arguments": {
      "pattern_name": "OR 1=1 Tautology",
      "query_example": "SELECT * FROM users WHERE username='admin' OR 1=1--'",
      "pattern_type": "sql_injection",
      "severity": 9
    }
  }
}
```

#### List Threat Patterns

```cpp
std::string patterns = anomaly_detector->list_threat_patterns();
// Returns JSON array of all patterns
```

#### Remove a Threat Pattern

```cpp
bool success = anomaly_detector->remove_threat_pattern(pattern_id);
```

### Built-in Threat Patterns

See `scripts/add_threat_patterns.sh` for 10 example threat patterns:

| Pattern | Type | Severity |
|---------|------|----------|
| OR 1=1 Tautology | sql_injection | 9 |
| UNION SELECT | sql_injection | 8 |
| Comment Injection | sql_injection | 7 |
| Sleep-based DoS | dos | 6 |
| Benchmark-based DoS | dos | 6 |
| INTO OUTFILE | data_exfiltration | 9 |
| DROP TABLE | privilege_escalation | 10 |
| Schema Probing | reconnaissance | 3 |
| CONCAT Injection | sql_injection | 8 |
| Hex Encoding | sql_injection | 7 |

### Detection Example

```sql
-- Known threat pattern in database:
-- "SELECT * FROM users WHERE id=1 OR 1=1--"

-- Attacker tries variation:
SELECT * FROM users WHERE id=5 OR 2=2--';

-- Embedding similarity detects this as similar to OR 1=1 pattern
-- Risk score: (9/10) * (1 - 0.15/2) = 0.86 (86% risk)
-- Since 86 > 70 (risk_threshold), query is BLOCKED
```

### Anomaly Statistics

```sql
-- View anomaly statistics
SHOW STATUS LIKE 'ai_anomaly_%';
-- ai_detected_anomalies
-- ai_blocked_queries
-- ai_flagged_queries
```

Via API:
```cpp
std::string stats = anomaly_detector->get_statistics();
// Returns JSON with detailed statistics
```

## Vector Database

### Schema

The vector database (`ai_features.db`) contains:

#### Main Tables

**nl2sql_cache**
```sql
CREATE TABLE nl2sql_cache (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    natural_language TEXT NOT NULL,
    generated_sql TEXT NOT NULL,
    schema_context TEXT,
    embedding BLOB,
    hit_count INTEGER DEFAULT 0,
    last_hit INTEGER,
    created_at INTEGER
);
```

**anomaly_patterns**
```sql
CREATE TABLE anomaly_patterns (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    pattern_name TEXT,
    pattern_type TEXT,  -- 'sql_injection', 'dos', 'privilege_escalation'
    query_example TEXT,
    embedding BLOB,
    severity INTEGER,  -- 1-10
    created_at INTEGER
);
```

**query_history**
```sql
CREATE TABLE query_history (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    query_text TEXT NOT NULL,
    generated_sql TEXT,
    embedding BLOB,
    execution_time_ms INTEGER,
    success BOOLEAN,
    timestamp INTEGER
);
```

#### Virtual Vector Tables (sqlite-vec)

```sql
CREATE VIRTUAL TABLE nl2sql_cache_vec USING vec0(
    embedding float(1536)
);

CREATE VIRTUAL TABLE anomaly_patterns_vec USING vec0(
    embedding float(1536)
);

CREATE VIRTUAL TABLE query_history_vec USING vec0(
    embedding float(1536)
);
```

### Similarity Search Algorithm

**Cosine Distance** is used for similarity measurement:

```
distance = 2 * (1 - cosine_similarity)

where:
cosine_similarity = (A . B) / (|A| * |B|)

Distance range: 0 (identical) to 2 (opposite)
Similarity = (2 - distance) / 2 * 100
```

**Threshold Conversion**:
```
similarity_threshold (0-100) → distance_threshold (0-2)
distance_threshold = 2.0 - (similarity_threshold / 50.0)

Example:
  similarity = 85 → distance = 2.0 - (85/50.0) = 0.3
```

### KNN Search Example

```sql
-- Find similar cached queries
SELECT c.natural_language, c.generated_sql,
       vec_distance_cosine(v.embedding, '[0.1, 0.2, ...]') as distance
FROM nl2sql_cache c
JOIN nl2sql_cache_vec v ON c.id = v.rowid
WHERE v.embedding MATCH '[0.1, 0.2, ...]'
AND distance < 0.3
ORDER BY distance
LIMIT 1;
```

## GenAI Integration

Vector Features use the existing **GenAI Module** for embedding generation.

### Embedding Endpoint

- **Module**: `lib/GenAI_Thread.cpp`
- **Global Handler**: `GenAI_Threads_Handler *GloGATH`
- **Method**: `embed_documents({text})`
- **Returns**: `GenAI_EmbeddingResult` with `float* data`, `embedding_size`, `count`

### Configuration

GenAI module connects to llama-server for embeddings:

```cpp
// Endpoint: http://127.0.0.1:8013/embedding
// Model: nomic-embed-text-v1.5 (or similar)
// Dimension: 1536
```

### Memory Management

```cpp
// GenAI returns malloc'd data - must free after copying
GenAI_EmbeddingResult result = GloGATH->embed_documents({text});

std::vector<float> embedding(result.data, result.data + result.embedding_size);
free(result.data);  // Important: free the original data
```

## Performance

### Embedding Generation

| Operation | Time | Notes |
|-----------|------|-------|
| Generate embedding | ~100-300ms | Via llama-server (local) |
| Vector cache search | ~10-50ms | KNN search with sqlite-vec |
| Pattern similarity check | ~10-50ms | KNN search with sqlite-vec |

### Cache Benefits

- **Cache hit**: ~10-50ms (vs 1-5s for LLM call)
- **Semantic matching**: Higher hit rate than exact text cache
- **Reduced LLM costs**: Fewer API calls to cloud providers

### Storage

- **Embedding size**: 1536 floats × 4 bytes = ~6 KB per query
- **1000 cached queries**: ~6 MB + overhead
- **100 threat patterns**: ~600 KB

## Troubleshooting

### Vector Features Not Working

1. **Check AI features enabled**:
   ```sql
   SELECT * FROM runtime_mysql_servers
   WHERE variable_name LIKE 'ai_%_enabled';
   ```

2. **Check vector DB exists**:
   ```bash
   ls -la /var/lib/proxysql/ai_features.db
   ```

3. **Check GenAI handler initialized**:
   ```bash
   tail -f proxysql.log | grep GenAI
   ```

4. **Check llama-server running**:
   ```bash
   curl http://127.0.0.1:8013/embedding
   ```

### Poor Similarity Detection

1. **Adjust thresholds**:
   ```sql
   -- Lower threshold = more sensitive (more false positives)
   SET ai_anomaly_similarity_threshold='80';
   ```

2. **Add more threat patterns**:
   ```cpp
   anomaly_detector->add_threat_pattern(...);
   ```

3. **Check embedding quality**:
   - Ensure llama-server is using a good embedding model
   - Verify query normalization is working

### Cache Issues

```sql
-- Clear cache (via API, not SQL yet)
anomaly_detector->clear_cache();

-- Check cache statistics
SHOW STATUS LIKE 'ai_nl2sql_cache_%';
```

## Security Considerations

- **Embeddings are stored locally** in SQLite database
- **No external API calls** for similarity search
- **Threat patterns are user-defined** - ensure proper access control
- **Risk scores are heuristic** - tune thresholds for your environment

## Future Enhancements

- [ ] Automatic threat pattern learning from flagged queries
- [ ] Embedding model fine-tuning for SQL domain
- [ ] Distributed vector storage for large-scale deployments
- [ ] Real-time embedding updates for adaptive learning
- [ ] Multi-lingual support for embeddings

## API Reference

See `API.md` for complete API documentation.

## Architecture Details

See `ARCHITECTURE.md` for detailed architecture documentation.

## Testing Guide

See `TESTING.md` for testing instructions.