You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/doc/VECTOR_FEATURES
Rene Cannao 897d306d2d
Refactor: Simplify NL2SQL to use only generic providers
1 month ago
..
API.md feat: Implement NL2SQL vector cache and complete Anomaly threat pattern management 1 month ago
ARCHITECTURE.md feat: Implement NL2SQL vector cache and complete Anomaly threat pattern management 1 month ago
EXTERNAL_LLM_SETUP.md Refactor: Simplify NL2SQL to use only generic providers 1 month ago
README.md feat: Implement NL2SQL vector cache and complete Anomaly threat pattern management 1 month ago
TESTING.md feat: Implement NL2SQL vector cache and complete Anomaly threat pattern management 1 month ago

README.md

Vector Features - Embedding-Based Similarity for ProxySQL

Overview

Vector Features provide semantic similarity capabilities for ProxySQL using vector embeddings and sqlite-vec for efficient similarity search. This enables:

  • NL2SQL Vector Cache: Cache natural language queries by semantic meaning, not just exact text
  • Anomaly Detection: Detect SQL threats using embedding similarity against known attack patterns

Features

Feature Description Benefit
Semantic Caching Cache queries by meaning, not exact text Higher cache hit rates for similar queries
Threat Detection Detect attacks using embedding similarity Catch variations of known attack patterns
Vector Storage sqlite-vec for efficient KNN search Fast similarity queries on embedded vectors
GenAI Integration Uses existing GenAI module for embeddings No external embedding service required
Configurable Thresholds Adjust similarity sensitivity Balance between false positives and negatives

Architecture

Query Input
    |
    v
+-----------------+
| GenAI Module    | -> Generate 1536-dim embedding
| (llama-server)  |
+-----------------+
    |
    v
+-----------------+
| Vector DB       | -> Store embedding in SQLite
| (sqlite-vec)    | -> Similarity search via KNN
+-----------------+
    |
    v
+-----------------+
| Result          | -> Similar items within threshold
+-----------------+

Quick Start

1. Enable AI Features

-- Via admin interface
SET ai_features_enabled='true';
LOAD MYSQL VARIABLES TO RUNTIME;

2. Configure Vector Database

-- Set vector DB path (default: /var/lib/proxysql/ai_features.db)
SET ai_vector_db_path='/var/lib/proxysql/ai_features.db';

-- Set vector dimension (default: 1536 for text-embedding-3-small)
SET ai_vector_dimension='1536';

3. Configure NL2SQL Vector Cache

-- Enable NL2SQL
SET ai_nl2sql_enabled='true';

-- Set cache similarity threshold (0-100, default: 85)
SET ai_nl2sql_cache_similarity_threshold='85';

4. Configure Anomaly Detection

-- Enable anomaly detection
SET ai_anomaly_detection_enabled='true';

-- Set similarity threshold (0-100, default: 85)
SET ai_anomaly_similarity_threshold='85';

-- Set risk threshold (0-100, default: 70)
SET ai_anomaly_risk_threshold='70';

NL2SQL Vector Cache

How It Works

  1. User submits NL2SQL query: NL2SQL: Show all customers
  2. Generate embedding: Query text → 1536-dimensional vector
  3. Search cache: Find semantically similar cached queries
  4. Return cached SQL if similarity > threshold
  5. Otherwise call LLM and store result in cache

Configuration Variables

Variable Default Description
ai_nl2sql_enabled true Enable/disable NL2SQL
ai_nl2sql_cache_similarity_threshold 85 Semantic similarity threshold (0-100)
ai_nl2sql_timeout_ms 30000 LLM request timeout
ai_vector_db_path /var/lib/proxysql/ai_features.db Vector database file path
ai_vector_dimension 1536 Embedding dimension

Example: Semantic Cache Hit

-- First query - calls LLM
NL2SQL: Show me all customers from USA;

-- Similar query - returns cached result (no LLM call!)
NL2SQL: Display customers in the United States;

-- Another similar query - cached
NL2SQL: List USA customers;

All three queries are semantically similar and will hit the cache after the first one.

Cache Statistics

-- View cache statistics
SHOW STATUS LIKE 'ai_nl2sql_cache_%';

Anomaly Detection

How It Works

  1. Query intercepted during session processing
  2. Generate embedding of normalized query
  3. KNN search against threat pattern embeddings
  4. Calculate risk score: (severity / 10) * (1 - distance / 2)
  5. Block or flag if risk > threshold

Configuration Variables

Variable Default Description
ai_anomaly_detection_enabled true Enable/disable anomaly detection
ai_anomaly_similarity_threshold 85 Similarity threshold for threat matching (0-100)
ai_anomaly_risk_threshold 70 Risk score threshold for blocking (0-100)
ai_anomaly_rate_limit 100 Max anomalies per minute before rate limiting
ai_anomaly_auto_block true Automatically block high-risk queries
ai_anomaly_log_only false If true, log but don't block

Threat Pattern Management

Add a Threat Pattern

Via C++ API:

anomaly_detector->add_threat_pattern(
    "OR 1=1 Tautology",
    "SELECT * FROM users WHERE username='admin' OR 1=1--'",
    "sql_injection",
    9  // severity 1-10
);

Via MCP (future):

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "ai_add_threat_pattern",
    "arguments": {
      "pattern_name": "OR 1=1 Tautology",
      "query_example": "SELECT * FROM users WHERE username='admin' OR 1=1--'",
      "pattern_type": "sql_injection",
      "severity": 9
    }
  }
}

List Threat Patterns

std::string patterns = anomaly_detector->list_threat_patterns();
// Returns JSON array of all patterns

Remove a Threat Pattern

bool success = anomaly_detector->remove_threat_pattern(pattern_id);

Built-in Threat Patterns

See scripts/add_threat_patterns.sh for 10 example threat patterns:

Pattern Type Severity
OR 1=1 Tautology sql_injection 9
UNION SELECT sql_injection 8
Comment Injection sql_injection 7
Sleep-based DoS dos 6
Benchmark-based DoS dos 6
INTO OUTFILE data_exfiltration 9
DROP TABLE privilege_escalation 10
Schema Probing reconnaissance 3
CONCAT Injection sql_injection 8
Hex Encoding sql_injection 7

Detection Example

-- Known threat pattern in database:
-- "SELECT * FROM users WHERE id=1 OR 1=1--"

-- Attacker tries variation:
SELECT * FROM users WHERE id=5 OR 2=2--';

-- Embedding similarity detects this as similar to OR 1=1 pattern
-- Risk score: (9/10) * (1 - 0.15/2) = 0.86 (86% risk)
-- Since 86 > 70 (risk_threshold), query is BLOCKED

Anomaly Statistics

-- View anomaly statistics
SHOW STATUS LIKE 'ai_anomaly_%';
-- ai_detected_anomalies
-- ai_blocked_queries
-- ai_flagged_queries

Via API:

std::string stats = anomaly_detector->get_statistics();
// Returns JSON with detailed statistics

Vector Database

Schema

The vector database (ai_features.db) contains:

Main Tables

nl2sql_cache

CREATE TABLE nl2sql_cache (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    natural_language TEXT NOT NULL,
    generated_sql TEXT NOT NULL,
    schema_context TEXT,
    embedding BLOB,
    hit_count INTEGER DEFAULT 0,
    last_hit INTEGER,
    created_at INTEGER
);

anomaly_patterns

CREATE TABLE anomaly_patterns (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    pattern_name TEXT,
    pattern_type TEXT,  -- 'sql_injection', 'dos', 'privilege_escalation'
    query_example TEXT,
    embedding BLOB,
    severity INTEGER,  -- 1-10
    created_at INTEGER
);

query_history

CREATE TABLE query_history (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    query_text TEXT NOT NULL,
    generated_sql TEXT,
    embedding BLOB,
    execution_time_ms INTEGER,
    success BOOLEAN,
    timestamp INTEGER
);

Virtual Vector Tables (sqlite-vec)

CREATE VIRTUAL TABLE nl2sql_cache_vec USING vec0(
    embedding float(1536)
);

CREATE VIRTUAL TABLE anomaly_patterns_vec USING vec0(
    embedding float(1536)
);

CREATE VIRTUAL TABLE query_history_vec USING vec0(
    embedding float(1536)
);

Similarity Search Algorithm

Cosine Distance is used for similarity measurement:

distance = 2 * (1 - cosine_similarity)

where:
cosine_similarity = (A . B) / (|A| * |B|)

Distance range: 0 (identical) to 2 (opposite)
Similarity = (2 - distance) / 2 * 100

Threshold Conversion:

similarity_threshold (0-100) → distance_threshold (0-2)
distance_threshold = 2.0 - (similarity_threshold / 50.0)

Example:
  similarity = 85 → distance = 2.0 - (85/50.0) = 0.3

KNN Search Example

-- Find similar cached queries
SELECT c.natural_language, c.generated_sql,
       vec_distance_cosine(v.embedding, '[0.1, 0.2, ...]') as distance
FROM nl2sql_cache c
JOIN nl2sql_cache_vec v ON c.id = v.rowid
WHERE v.embedding MATCH '[0.1, 0.2, ...]'
AND distance < 0.3
ORDER BY distance
LIMIT 1;

GenAI Integration

Vector Features use the existing GenAI Module for embedding generation.

Embedding Endpoint

  • Module: lib/GenAI_Thread.cpp
  • Global Handler: GenAI_Threads_Handler *GloGATH
  • Method: embed_documents({text})
  • Returns: GenAI_EmbeddingResult with float* data, embedding_size, count

Configuration

GenAI module connects to llama-server for embeddings:

// Endpoint: http://127.0.0.1:8013/embedding
// Model: nomic-embed-text-v1.5 (or similar)
// Dimension: 1536

Memory Management

// GenAI returns malloc'd data - must free after copying
GenAI_EmbeddingResult result = GloGATH->embed_documents({text});

std::vector<float> embedding(result.data, result.data + result.embedding_size);
free(result.data);  // Important: free the original data

Performance

Embedding Generation

Operation Time Notes
Generate embedding ~100-300ms Via llama-server (local)
Vector cache search ~10-50ms KNN search with sqlite-vec
Pattern similarity check ~10-50ms KNN search with sqlite-vec

Cache Benefits

  • Cache hit: ~10-50ms (vs 1-5s for LLM call)
  • Semantic matching: Higher hit rate than exact text cache
  • Reduced LLM costs: Fewer API calls to cloud providers

Storage

  • Embedding size: 1536 floats × 4 bytes = ~6 KB per query
  • 1000 cached queries: ~6 MB + overhead
  • 100 threat patterns: ~600 KB

Troubleshooting

Vector Features Not Working

  1. Check AI features enabled:

    SELECT * FROM runtime_mysql_servers
    WHERE variable_name LIKE 'ai_%_enabled';
    
  2. Check vector DB exists:

    ls -la /var/lib/proxysql/ai_features.db
    
  3. Check GenAI handler initialized:

    tail -f proxysql.log | grep GenAI
    
  4. Check llama-server running:

    curl http://127.0.0.1:8013/embedding
    

Poor Similarity Detection

  1. Adjust thresholds:

    -- Lower threshold = more sensitive (more false positives)
    SET ai_anomaly_similarity_threshold='80';
    
  2. Add more threat patterns:

    anomaly_detector->add_threat_pattern(...);
    
  3. Check embedding quality:

    • Ensure llama-server is using a good embedding model
    • Verify query normalization is working

Cache Issues

-- Clear cache (via API, not SQL yet)
anomaly_detector->clear_cache();

-- Check cache statistics
SHOW STATUS LIKE 'ai_nl2sql_cache_%';

Security Considerations

  • Embeddings are stored locally in SQLite database
  • No external API calls for similarity search
  • Threat patterns are user-defined - ensure proper access control
  • Risk scores are heuristic - tune thresholds for your environment

Future Enhancements

  • Automatic threat pattern learning from flagged queries
  • Embedding model fine-tuning for SQL domain
  • Distributed vector storage for large-scale deployments
  • Real-time embedding updates for adaptive learning
  • Multi-lingual support for embeddings

API Reference

See API.md for complete API documentation.

Architecture Details

See ARCHITECTURE.md for detailed architecture documentation.

Testing Guide

See TESTING.md for testing instructions.