8.6 KiB

Raw Blame History

Vector Embeddings Implementation Plan (NOT YET IMPLEMENTED)

Overview

This document describes the planned implementation of Vector Embeddings capabilities for the ProxySQL MCP Query endpoint. The Embeddings system will enable AI agents to perform semantic similarity searches on database content using sqlite-vec for vector storage and sqlite-rembed for embedding generation.

Status: PLANNED ⏳

Requirements

Embedding Generation: Use sqlite-rembed (placeholder for future GenAI module)
Vector Storage: Use sqlite-vec extension (already compiled into ProxySQL)
Search Type: Semantic similarity search using vector distance
Integration: Work alongside FTS and Catalog for comprehensive search
Use Case: Find semantically similar content, not just keyword matches

Architecture

MCP Query Endpoint (JSON-RPC 2.0 over HTTPS)
    ↓
Query_Tool_Handler (routes tool calls)
    ↓
Discovery_Schema (manages embeddings database)
    ↓
SQLite with sqlite-vec (mcp_catalog.db)
    ↓
LLM_Bridge (embedding generation)
    ↓
External APIs (OpenAI, Ollama, Cohere, etc.)

Database Design

Integrated with Discovery Schema

Path: mcp_catalog.db (uses existing catalog database)

Schema

embedding_indexes (metadata table)

CREATE TABLE IF NOT EXISTS embedding_indexes (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  schema_name TEXT NOT NULL,
  table_name TEXT NOT NULL,
  columns TEXT NOT NULL,              -- JSON array: ["col1", "col2"]
  primary_key TEXT NOT NULL,          -- PK column name for identification
  where_clause TEXT,                  -- Optional WHERE filter
  model_name TEXT NOT NULL,           -- e.g., "text-embedding-3-small"
  vector_dim INTEGER NOT NULL,        -- e.g., 1536 for OpenAI small
  embedding_strategy TEXT NOT NULL,   -- "concat", "average", "separate"
  row_count INTEGER DEFAULT 0,
  indexed_at INTEGER DEFAULT (strftime('%s', 'now')),
  UNIQUE(schema_name, table_name)
);

CREATE INDEX IF NOT EXISTS idx_embedding_indexes_schema ON embedding_indexes(schema_name);
CREATE INDEX IF NOT EXISTS idx_embedding_indexes_table ON embedding_indexes(table_name);
CREATE INDEX IF NOT EXISTS idx_embedding_indexes_model ON embedding_indexes(model_name);

Per-Index vec0 Tables (created dynamically)

For each indexed table, create a sqlite-vec virtual table:

-- For OpenAI text-embedding-3-small (1536 dimensions)
CREATE VIRTUAL TABLE embeddings_<sanitized_schema>_<sanitized_table> USING vec0(
  vector float[1536],
  pk_value TEXT,
  metadata TEXT
);

Table Components:

vector - The embedding vector (required by vec0)
pk_value - Primary key value for MySQL lookup
metadata - JSON with original row data

Sanitization:

Replace . and special characters with _
Example: testdb.orders → embeddings_testdb_orders

Tools (6 total)

1. embed_index_table

Generate embeddings and create a vector index for a MySQL table.

Parameters:

Name	Type	Required	Description
schema	string	Yes	Schema name
table	string	Yes	Table name
columns	string	Yes	JSON array of column names to embed
primary_key	string	Yes	Primary key column name
where_clause	string	No	Optional WHERE clause for filtering rows
model	string	Yes	Embedding model name (e.g., "text-embedding-3-small")
strategy	string	No	Embedding strategy: "concat" (default), "average", "separate"

Embedding Strategies:

Strategy	Description	When to Use
`concat`	Concatenate all columns with spaces, generate one embedding	Most common, semantic meaning of combined content
`average`	Generate embedding per column, average them	Multiple independent columns
`separate`	Store embeddings separately per column	Need column-specific similarity

Response:

{
  "success": true,
  "schema": "testdb",
  "table": "orders",
  "model": "text-embedding-3-small",
  "vector_dim": 1536,
  "row_count": 5000,
  "indexed_at": 1736668800
}

Implementation Logic:

Validate parameters (table exists, columns valid)
Check if index already exists
Create vec0 table: embeddings_<sanitized_schema>_<sanitized_table>
Get vector dimension from model (or default to 1536)
Configure sqlite-rembed client (if not already configured)
Fetch all rows from MySQL using execute_query()
For each row:
- Build content string based on strategy
- Call rembed() to generate embedding
- Store vector + metadata in vec0 table
Update embedding_indexes metadata
Return result

Code Example (concat strategy):

-- Configure rembed client
INSERT INTO temp.rembed_clients(name, format, model, key)
VALUES ('mcp_embeddings', 'openai', 'text-embedding-3-small', 'sk-...');

-- Generate and insert embeddings
INSERT INTO embeddings_testdb_orders(rowid, vector, pk_value, metadata)
SELECT
    ROWID,
    rembed('mcp_embeddings',
           COALESCE(customer_name, '') || ' ' ||
           COALESCE(product_name, '') || ' ' ||
           COALESCE(notes, '')) as vector,

## Implementation Status

### Phase 1: Foundation ⏳ PLANNED

**Step 1: Integrate Embeddings into Discovery_Schema**
- Embeddings functionality to be built into `lib/Discovery_Schema.cpp`
- Will use existing `mcp_catalog.db` database
- Will require new configuration variable `mcp-embeddingpath`

**Step 2: Create Embeddings tables**
- `embedding_indexes` for metadata
- `embedding_data_<schema>_<table>` for vector storage
- Integration with sqlite-vec extension

### Phase 2: Core Indexing ⏳ PLANNED

**Step 3: Implement embedding generation**
- Integration with LLM_Bridge for embedding generation
- Support for multiple embedding models
- Batch processing for performance

### Phase 3: Search Functionality ⏳ PLANNED

**Step 4: Implement search tools**
- `embedding_search` tool in Query_Tool_Handler
- Semantic similarity search with ranking

### Phase 4: Tool Registration ⏳ PLANNED

**Step 5: Register tools**
- Tools to be registered in Query_Tool_Handler::get_tool_list()
- Tools to be routed in Query_Tool_Handler::execute_tool()

## Critical Files (PLANNED)

### Files to Create
- `include/MySQL_Embeddings.h` - Embeddings class header
- `lib/MySQL_Embeddings.cpp` - Embeddings class implementation

### Files to Modify
- `include/Discovery_Schema.h` - Add Embeddings methods
- `lib/Discovery_Schema.cpp` - Implement Embeddings functionality
- `lib/Query_Tool_Handler.cpp` - Add Embeddings tool routing
- `include/Query_Tool_Handler.h` - Add Embeddings tool declarations
- `include/MCP_Thread.h` - Add `mcp_embedding_path` variable
- `lib/MCP_Thread.cpp` - Handle `embedding_path` configuration
- `lib/ProxySQL_MCP_Server.cpp` - Pass `embedding_path` to components
- `Makefile` - Add MySQL_Embeddings.cpp to build

## Future Implementation Details

### Embeddings Integration Pattern

```cpp
class Discovery_Schema {
private:
    // Embeddings methods (PLANNED)
    int create_embedding_tables();
    int generate_embeddings(int run_id);
    json search_embeddings(const std::string& query, const std::string& schema = "", 
                          const std::string& table = "", int limit = 10);
    
public:
    // Embeddings to be maintained during:
    // - Object processing (static harvest)
    // - LLM artifact creation
    // - Catalog rebuild operations
};

Agent Workflow Example (PLANNED)

# Agent performs semantic search
semantic_results = call_tool("embedding_search", {
    "query": "find tables related to customer purchases",
    "limit": 10
})

# Agent combines with FTS results
fts_results = call_tool("catalog_search", {
    "query": "customer order"
})

# Agent uses combined results for comprehensive understanding

Future Performance Considerations

Batch Processing: Generate embeddings in batches for performance
Model Selection: Support multiple embedding models with different dimensions
Caching: Cache frequently used embeddings
Indexing: Use ANN (Approximate Nearest Neighbor) for large vector sets

Implementation Prerequisites

sqlite-vec extension compiled into ProxySQL
sqlite-rembed integration with LLM_Bridge
Configuration variable support
Tool handler integration

Notes

Vector embeddings will complement FTS for comprehensive search
Integration with existing catalog for unified search experience
Support for multiple embedding models and providers
Automatic embedding generation during discovery processes

Version

Last Updated: 2026-01-19
Status: Planned feature, not yet implemented

8.6 KiB Raw Blame History