You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/doc/MCP/Vector_Embeddings_Implement...

8.6 KiB

Vector Embeddings Implementation Plan (NOT YET IMPLEMENTED)

Overview

This document describes the planned implementation of Vector Embeddings capabilities for the ProxySQL MCP Query endpoint. The Embeddings system will enable AI agents to perform semantic similarity searches on database content using sqlite-vec for vector storage and sqlite-rembed for embedding generation.

Status: PLANNED

Requirements

  1. Embedding Generation: Use sqlite-rembed (placeholder for future GenAI module)
  2. Vector Storage: Use sqlite-vec extension (already compiled into ProxySQL)
  3. Search Type: Semantic similarity search using vector distance
  4. Integration: Work alongside FTS and Catalog for comprehensive search
  5. Use Case: Find semantically similar content, not just keyword matches

Architecture

MCP Query Endpoint (JSON-RPC 2.0 over HTTPS)
    ↓
Query_Tool_Handler (routes tool calls)
    ↓
Discovery_Schema (manages embeddings database)
    ↓
SQLite with sqlite-vec (mcp_catalog.db)
    ↓
LLM_Bridge (embedding generation)
    ↓
External APIs (OpenAI, Ollama, Cohere, etc.)

Database Design

Integrated with Discovery Schema

Path: mcp_catalog.db (uses existing catalog database)

Schema

embedding_indexes (metadata table)

CREATE TABLE IF NOT EXISTS embedding_indexes (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  schema_name TEXT NOT NULL,
  table_name TEXT NOT NULL,
  columns TEXT NOT NULL,              -- JSON array: ["col1", "col2"]
  primary_key TEXT NOT NULL,          -- PK column name for identification
  where_clause TEXT,                  -- Optional WHERE filter
  model_name TEXT NOT NULL,           -- e.g., "text-embedding-3-small"
  vector_dim INTEGER NOT NULL,        -- e.g., 1536 for OpenAI small
  embedding_strategy TEXT NOT NULL,   -- "concat", "average", "separate"
  row_count INTEGER DEFAULT 0,
  indexed_at INTEGER DEFAULT (strftime('%s', 'now')),
  UNIQUE(schema_name, table_name)
);

CREATE INDEX IF NOT EXISTS idx_embedding_indexes_schema ON embedding_indexes(schema_name);
CREATE INDEX IF NOT EXISTS idx_embedding_indexes_table ON embedding_indexes(table_name);
CREATE INDEX IF NOT EXISTS idx_embedding_indexes_model ON embedding_indexes(model_name);

Per-Index vec0 Tables (created dynamically)

For each indexed table, create a sqlite-vec virtual table:

-- For OpenAI text-embedding-3-small (1536 dimensions)
CREATE VIRTUAL TABLE embeddings_<sanitized_schema>_<sanitized_table> USING vec0(
  vector float[1536],
  pk_value TEXT,
  metadata TEXT
);

Table Components:

  • vector - The embedding vector (required by vec0)
  • pk_value - Primary key value for MySQL lookup
  • metadata - JSON with original row data

Sanitization:

  • Replace . and special characters with _
  • Example: testdb.ordersembeddings_testdb_orders

Tools (6 total)

1. embed_index_table

Generate embeddings and create a vector index for a MySQL table.

Parameters:

Name Type Required Description
schema string Yes Schema name
table string Yes Table name
columns string Yes JSON array of column names to embed
primary_key string Yes Primary key column name
where_clause string No Optional WHERE clause for filtering rows
model string Yes Embedding model name (e.g., "text-embedding-3-small")
strategy string No Embedding strategy: "concat" (default), "average", "separate"

Embedding Strategies:

Strategy Description When to Use
concat Concatenate all columns with spaces, generate one embedding Most common, semantic meaning of combined content
average Generate embedding per column, average them Multiple independent columns
separate Store embeddings separately per column Need column-specific similarity

Response:

{
  "success": true,
  "schema": "testdb",
  "table": "orders",
  "model": "text-embedding-3-small",
  "vector_dim": 1536,
  "row_count": 5000,
  "indexed_at": 1736668800
}

Implementation Logic:

  1. Validate parameters (table exists, columns valid)
  2. Check if index already exists
  3. Create vec0 table: embeddings_<sanitized_schema>_<sanitized_table>
  4. Get vector dimension from model (or default to 1536)
  5. Configure sqlite-rembed client (if not already configured)
  6. Fetch all rows from MySQL using execute_query()
  7. For each row:
    • Build content string based on strategy
    • Call rembed() to generate embedding
    • Store vector + metadata in vec0 table
  8. Update embedding_indexes metadata
  9. Return result

Code Example (concat strategy):

-- Configure rembed client
INSERT INTO temp.rembed_clients(name, format, model, key)
VALUES ('mcp_embeddings', 'openai', 'text-embedding-3-small', 'sk-...');

-- Generate and insert embeddings
INSERT INTO embeddings_testdb_orders(rowid, vector, pk_value, metadata)
SELECT
    ROWID,
    rembed('mcp_embeddings',
           COALESCE(customer_name, '') || ' ' ||
           COALESCE(product_name, '') || ' ' ||
           COALESCE(notes, '')) as vector,

## Implementation Status

### Phase 1: Foundation  PLANNED

**Step 1: Integrate Embeddings into Discovery_Schema**
- Embeddings functionality to be built into `lib/Discovery_Schema.cpp`
- Will use existing `mcp_catalog.db` database
- Will require new configuration variable `mcp-embeddingpath`

**Step 2: Create Embeddings tables**
- `embedding_indexes` for metadata
- `embedding_data_<schema>_<table>` for vector storage
- Integration with sqlite-vec extension

### Phase 2: Core Indexing  PLANNED

**Step 3: Implement embedding generation**
- Integration with LLM_Bridge for embedding generation
- Support for multiple embedding models
- Batch processing for performance

### Phase 3: Search Functionality  PLANNED

**Step 4: Implement search tools**
- `embedding_search` tool in Query_Tool_Handler
- Semantic similarity search with ranking

### Phase 4: Tool Registration  PLANNED

**Step 5: Register tools**
- Tools to be registered in Query_Tool_Handler::get_tool_list()
- Tools to be routed in Query_Tool_Handler::execute_tool()

## Critical Files (PLANNED)

### Files to Create
- `include/MySQL_Embeddings.h` - Embeddings class header
- `lib/MySQL_Embeddings.cpp` - Embeddings class implementation

### Files to Modify
- `include/Discovery_Schema.h` - Add Embeddings methods
- `lib/Discovery_Schema.cpp` - Implement Embeddings functionality
- `lib/Query_Tool_Handler.cpp` - Add Embeddings tool routing
- `include/Query_Tool_Handler.h` - Add Embeddings tool declarations
- `include/MCP_Thread.h` - Add `mcp_embedding_path` variable
- `lib/MCP_Thread.cpp` - Handle `embedding_path` configuration
- `lib/ProxySQL_MCP_Server.cpp` - Pass `embedding_path` to components
- `Makefile` - Add MySQL_Embeddings.cpp to build

## Future Implementation Details

### Embeddings Integration Pattern

```cpp
class Discovery_Schema {
private:
    // Embeddings methods (PLANNED)
    int create_embedding_tables();
    int generate_embeddings(int run_id);
    json search_embeddings(const std::string& query, const std::string& schema = "", 
                          const std::string& table = "", int limit = 10);
    
public:
    // Embeddings to be maintained during:
    // - Object processing (static harvest)
    // - LLM artifact creation
    // - Catalog rebuild operations
};

Agent Workflow Example (PLANNED)

# Agent performs semantic search
semantic_results = call_tool("embedding_search", {
    "query": "find tables related to customer purchases",
    "limit": 10
})

# Agent combines with FTS results
fts_results = call_tool("catalog_search", {
    "query": "customer order"
})

# Agent uses combined results for comprehensive understanding

Future Performance Considerations

  1. Batch Processing: Generate embeddings in batches for performance
  2. Model Selection: Support multiple embedding models with different dimensions
  3. Caching: Cache frequently used embeddings
  4. Indexing: Use ANN (Approximate Nearest Neighbor) for large vector sets

Implementation Prerequisites

  • sqlite-vec extension compiled into ProxySQL
  • sqlite-rembed integration with LLM_Bridge
  • Configuration variable support
  • Tool handler integration

Notes

  • Vector embeddings will complement FTS for comprehensive search
  • Integration with existing catalog for unified search experience
  • Support for multiple embedding models and providers
  • Automatic embedding generation during discovery processes

Version

  • Last Updated: 2026-01-19
  • Status: Planned feature, not yet implemented