From 2ef44e7c3e41dedd6e70e7fb503214be58adac63 Mon Sep 17 00:00:00 2001 From: Rene Cannao Date: Mon, 12 Jan 2026 11:49:26 +0000 Subject: [PATCH] Add MCP implementation plans for FTS and Vector Embeddings Comprehensive implementation documentation for two new search capabilities: FTS (Full Text Search): - 6 tools for lexical search using SQLite FTS5 - Separate mcp_fts.db database - Keyword matching and phrase search - Tools: fts_index_table, fts_search, fts_list_indexes, fts_delete_index, fts_reindex, fts_rebuild_all Vector Embeddings: - 6 tools for semantic search using sqlite-vec - Separate mcp_embeddings.db database - Vector similarity search with sqlite-rembed integration - Placeholder for future GenAI module - Tools: embed_index_table, embed_search, embed_list_indexes, embed_delete_index, embed_reindex, embed_rebuild_all Both systems: - Follow MySQL_Catalog patterns for SQLite management - Integrate with existing MCP Query endpoint - Work alongside Catalog for AI agent memory - 13-step implementation plans with detailed code examples --- doc/MCP/FTS_Implementation_Plan.md | 582 ++++++++++++ .../Vector_Embeddings_Implementation_Plan.md | 884 ++++++++++++++++++ 2 files changed, 1466 insertions(+) create mode 100644 doc/MCP/FTS_Implementation_Plan.md create mode 100644 doc/MCP/Vector_Embeddings_Implementation_Plan.md diff --git a/doc/MCP/FTS_Implementation_Plan.md b/doc/MCP/FTS_Implementation_Plan.md new file mode 100644 index 000000000..4a06d4aae --- /dev/null +++ b/doc/MCP/FTS_Implementation_Plan.md @@ -0,0 +1,582 @@ +# Full Text Search (FTS) Implementation Plan + +## Overview + +This document describes the implementation of Full Text Search (FTS) capabilities for the ProxySQL MCP Query endpoint. The FTS system enables AI agents to quickly search indexed data before querying the full MySQL database, using SQLite's FTS5 extension. + +## Requirements + +1. **Indexing Strategy**: Optional WHERE clauses, no incremental updates (full rebuild on reindex) +2. **Search Scope**: Agent decides - single table or cross-table search +3. **Storage**: All rows (no limits) +4. **Catalog Integration**: Cross-reference between FTS and catalog - agent can use FTS to get top N IDs, then query real database +5. **Use Case**: FTS as another tool in the agent's toolkit + +## Architecture + +### Components + +``` +MCP Query Endpoint + ↓ +Query_Tool_Handler (routes tool calls) + ↓ +MySQL_Tool_Handler (implements tools) + ↓ +MySQL_FTS (new class - manages FTS database) + ↓ +SQLite FTS5 (mcp_fts.db) +``` + +### Database Design + +**Separate SQLite database**: `mcp_fts.db` (configurable via `mcp-ftspath` variable) + +**Tables**: +- `fts_indexes` - Metadata for all indexes +- `fts_data_` - Content tables (one per index) +- `fts_search_` - FTS5 virtual tables (one per index) + +## Tools (6 total) + +### 1. fts_index_table + +Create and populate an FTS index for a MySQL table. + +**Parameters**: +| Name | Type | Required | Description | +|------|------|----------|-------------| +| schema | string | Yes | Schema name | +| table | string | Yes | Table name | +| columns | string | Yes | JSON array of column names to index | +| primary_key | string | Yes | Primary key column name | +| where_clause | string | No | Optional WHERE clause for filtering | + +**Response**: +```json +{ + "success": true, + "schema": "sales", + "table": "orders", + "row_count": 15000, + "indexed_at": 1736668800 +} +``` + +**Implementation Logic**: +1. Validate parameters (table exists, columns are valid) +2. Check if index already exists +3. Create dynamic tables: `fts_data__` and `fts_search__
` +4. Fetch all rows from MySQL using `execute_query()` +5. For each row: + - Concatenate indexed column values into searchable content + - Store original row data as JSON metadata + - Insert into data table (triggers sync to FTS) +6. Update `fts_indexes` metadata +7. Return result + +### 2. fts_search + +Search indexed data using FTS5. + +**Parameters**: +| Name | Type | Required | Description | +|------|------|----------|-------------| +| query | string | Yes | FTS5 search query | +| schema | string | No | Filter by schema | +| table | string | No | Filter by table | +| limit | integer | No | Max results (default: 100) | +| offset | integer | No | Pagination offset (default: 0) | + +**Response**: +```json +{ + "success": true, + "query": "urgent order", + "total_matches": 234, + "results": [ + { + "schema": "sales", + "table": "orders", + "primary_key_value": "12345", + "snippet": "Customer has urgentorder...", + "metadata": "{\"order_id\":12345,\"customer_id\":987,...}" + } + ] +} +``` + +**Implementation Logic**: +1. Build FTS5 query with MATCH syntax +2. Apply schema/table filters if specified +3. Execute search with ranking (bm25) +4. Return results with snippets highlighting matches +5. Support pagination + +### 3. fts_list_indexes + +List all FTS indexes with metadata. + +**Parameters**: None + +**Response**: +```json +{ + "success": true, + "indexes": [ + { + "schema": "sales", + "table": "orders", + "columns": ["order_id", "customer_name", "notes"], + "primary_key": "order_id", + "row_count": 15000, + "indexed_at": 1736668800 + } + ] +} +``` + +**Implementation Logic**: +1. Query `fts_indexes` table +2. Return all indexes with metadata + +### 4. fts_delete_index + +Remove an FTS index. + +**Parameters**: +| Name | Type | Required | Description | +|------|------|----------|-------------| +| schema | string | Yes | Schema name | +| table | string | Yes | Table name | + +**Response**: +```json +{ + "success": true, + "schema": "sales", + "table": "orders", + "message": "Index deleted successfully" +} +``` + +**Implementation Logic**: +1. Validate index exists +2. Drop FTS search table +3. Drop data table +4. Remove metadata from `fts_indexes` + +### 5. fts_reindex + +Refresh an index with fresh data (full rebuild). + +**Parameters**: +| Name | Type | Required | Description | +|------|------|----------|-------------| +| schema | string | Yes | Schema name | +| table | string | Yes | Table name | + +**Response**: Same as `fts_index_table` + +**Implementation Logic**: +1. Fetch existing index metadata from `fts_indexes` +2. Delete existing data from tables +3. Call `index_table()` logic with stored metadata +4. Update `indexed_at` timestamp + +### 6. fts_rebuild_all + +Rebuild ALL FTS indexes with fresh data. + +**Parameters**: None + +**Response**: +```json +{ + "success": true, + "rebuilt_count": 5, + "failed": [], + "indexes": [ + { + "schema": "sales", + "table": "orders", + "row_count": 15200, + "status": "success" + } + ] +} +``` + +**Implementation Logic**: +1. Get all indexes from `fts_indexes` table +2. For each index: + - Call `reindex()` with stored metadata + - Track success/failure +3. Return summary with rebuilt count and any failures + +## Database Schema + +### fts_indexes (metadata table) +```sql +CREATE TABLE IF NOT EXISTS fts_indexes ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + schema_name TEXT NOT NULL, + table_name TEXT NOT NULL, + columns TEXT NOT NULL, -- JSON array of column names + primary_key TEXT NOT NULL, + where_clause TEXT, + row_count INTEGER DEFAULT 0, + indexed_at INTEGER DEFAULT (strftime('%s', 'now')), + UNIQUE(schema_name, table_name) +); + +CREATE INDEX IF NOT EXISTS idx_fts_indexes_schema ON fts_indexes(schema_name); +CREATE INDEX IF NOT EXISTS idx_fts_indexes_table ON fts_indexes(table_name); +``` + +### Per-Index Tables (created dynamically) + +For each indexed table, create: +```sql +-- Data table (stores actual content) +CREATE TABLE fts_data__ ( + rowid INTEGER PRIMARY KEY, + content TEXT NOT NULL, -- Concatenated searchable text + metadata TEXT -- JSON with original row data +); + +-- FTS5 virtual table (external content) +CREATE VIRTUAL TABLE fts_search__ USING fts5( + content, + metadata, + content='fts_data__', + content_rowid='rowid', + tokenize='porter unicode61' +); + +-- Triggers for automatic sync +CREATE TRIGGER fts_ai_ AFTER INSERT ON fts_data_ BEGIN + INSERT INTO fts_search_(rowid, content, metadata) + VALUES (new.rowid, new.content, new.metadata); +END; + +CREATE TRIGGER fts_ad_ AFTER DELETE ON fts_data_ BEGIN + INSERT INTO fts_search_(fts_search_, rowid, content, metadata) + VALUES ('delete', old.rowid, old.content, old.metadata); +END; + +CREATE TRIGGER fts_au_ AFTER UPDATE ON fts_data_ BEGIN + INSERT INTO fts_search_(fts_search_, rowid, content, metadata) + VALUES ('delete', old.rowid, old.content, old.metadata); + INSERT INTO fts_search_(rowid, content, metadata) + VALUES (new.rowid, new.content, new.metadata); +END; +``` + +## Implementation Steps + +### Phase 1: Foundation + +**Step 1: Create MySQL_FTS class** +- Create `include/MySQL_FTS.h` - Class header with method declarations +- Create `lib/MySQL_FTS.cpp` - Implementation +- Follow `MySQL_Catalog` pattern for SQLite management + +**Step 2: Add configuration variable** +- Modify `include/MCP_Thread.h` - Add `mcp_fts_path` to variables struct +- Modify `lib/MCP_Thread.cpp` - Add to `mcp_thread_variables_names` array +- Handle `fts_path` in get/set variable functions +- Default value: `"mcp_fts.db"` + +**Step 3: Integrate FTS into MySQL_Tool_Handler** +- Add `MySQL_FTS* fts` member to `include/MySQL_Tool_Handler.h` +- Initialize in constructor with `fts_path` +- Clean up in destructor +- Add FTS tool method declarations + +### Phase 2: Core Indexing + +**Step 4: Implement fts_index_table tool** +```cpp +// In MySQL_FTS class +std::string index_table( + const std::string& schema, + const std::string& table, + const std::string& columns, // JSON array + const std::string& primary_key, + const std::string& where_clause, + MySQL_Tool_Handler* mysql_handler +); +``` + +Logic: +- Parse columns JSON array +- Create sanitized table name (replace dots/underscores) +- Create `fts_data_*` and `fts_search_*` tables +- Fetch data: `mysql_handler->execute_query(sql)` +- Build content by concatenating column values +- Insert in batches for performance +- Update metadata + +**Step 5: Implement fts_list_indexes tool** +```cpp +std::string list_indexes(); +``` +Query `fts_indexes` and return JSON array. + +**Step 6: Implement fts_delete_index tool** +```cpp +std::string delete_index(const std::string& schema, const std::string& table); +``` +Drop tables and remove metadata. + +### Phase 3: Search Functionality + +**Step 7: Implement fts_search tool** +```cpp +std::string search( + const std::string& query, + const std::string& schema, + const std::string& table, + int limit, + int offset +); +``` + +SQL query template: +```sql +SELECT + d.schema_name, + d.table_name, + d.primary_key_value, + snippet(fts_search, 2, '', '', '...', 30) as snippet, + d.metadata +FROM fts_search s +JOIN fts_data d ON s.rowid = d.rowid +WHERE fts_search MATCH ? +ORDER BY bm25(fts_search) +LIMIT ? OFFSET ? +``` + +**Step 8: Implement fts_reindex tool** +```cpp +std::string reindex( + const std::string& schema, + const std::string& table, + MySQL_Tool_Handler* mysql_handler +); +``` +Fetch metadata, delete old data, rebuild. + +**Step 9: Implement fts_rebuild_all tool** +```cpp +std::string rebuild_all(MySQL_Tool_Handler* mysql_handler); +``` +Loop through all indexes and rebuild each. + +### Phase 4: Tool Registration + +**Step 10: Register tools in Query_Tool_Handler** +- Modify `lib/Query_Tool_Handler.cpp` +- Add to `get_tool_list()`: + ```cpp + tools.push_back(create_tool_schema( + "fts_index_table", + "Create/populate FTS index for a table", + {"schema", "table", "columns", "primary_key"}, + {{"where_clause", "string"}} + )); + // Repeat for all 6 tools + ``` +- Add routing in `execute_tool()`: + ```cpp + else if (tool_name == "fts_index_table") { + std::string schema = get_json_string(arguments, "schema"); + std::string table = get_json_string(arguments, "table"); + std::string columns = get_json_string(arguments, "columns"); + std::string primary_key = get_json_string(arguments, "primary_key"); + std::string where_clause = get_json_string(arguments, "where_clause"); + result_str = mysql_handler->fts_index_table(schema, table, columns, primary_key, where_clause); + } + // Repeat for other tools + ``` + +**Step 11: Update ProxySQL_MCP_Server** +- Modify `lib/ProxySQL_MCP_Server.cpp` +- Pass `fts_path` when creating MySQL_Tool_Handler +- Initialize FTS: `mysql_handler->get_fts()->init()` + +### Phase 5: Build and Test + +**Step 12: Update build system** +- Modify `Makefile` +- Add `lib/MySQL_FTS.cpp` to compilation sources +- Verify link against sqlite3 + +**Step 13: Testing** +- Test all 6 tools via MCP endpoint +- Verify JSON responses +- Test with actual MySQL data +- Test cross-table search +- Test WHERE clause filtering + +## Critical Files + +### New Files to Create +- `include/MySQL_FTS.h` - FTS class header +- `lib/MySQL_FTS.cpp` - FTS class implementation + +### Files to Modify +- `include/MySQL_Tool_Handler.h` - Add FTS member and tool method declarations +- `lib/MySQL_Tool_Handler.cpp` - Add FTS tool wrappers, initialize FTS +- `lib/Query_Tool_Handler.cpp` - Register and route FTS tools +- `include/MCP_Thread.h` - Add `mcp_fts_path` variable +- `lib/MCP_Thread.cpp` - Handle `fts_path` configuration +- `lib/ProxySQL_MCP_Server.cpp` - Pass `fts_path` to MySQL_Tool_Handler +- `Makefile` - Add MySQL_FTS.cpp to build + +## Code Patterns to Follow + +### MySQL_FTS Class Structure (similar to MySQL_Catalog) + +```cpp +class MySQL_FTS { +private: + SQLite3DB* db; + std::string db_path; + + int init_schema(); + int create_tables(); + int create_index_tables(const std::string& schema, const std::string& table); + std::string get_data_table_name(const std::string& schema, const std::string& table); + std::string get_fts_table_name(const std::string& schema, const std::string& table); + +public: + MySQL_FTS(const std::string& path); + ~MySQL_FTS(); + + int init(); + void close(); + + // Tool methods + std::string index_table(...); + std::string search(...); + std::string list_indexes(); + std::string delete_index(...); + std::string reindex(...); + std::string rebuild_all(...); + + bool index_exists(const std::string& schema, const std::string& table); + SQLite3DB* get_db() { return db; } +}; +``` + +### Error Handling Pattern + +```cpp +json result; +result["success"] = false; +result["error"] = "Descriptive error message"; +return result.dump(); + +// Logging +proxy_error("FTS error: %s\n", error_msg); +proxy_info("FTS index created: %s.%s\n", schema.c_str(), table.c_str()); +``` + +### SQLite Operations Pattern + +```cpp +db->wrlock(); +// Write operations +db->wrunlock(); + +db->rdlock(); +// Read operations +db->rdunlock(); + +// Prepared statements +sqlite3_stmt* stmt = NULL; +db->prepare_v2(sql, &stmt); +(*proxy_sqlite3_bind_text)(stmt, 1, value.c_str(), -1, SQLITE_TRANSIENT); +SAFE_SQLITE3_STEP2(stmt); +(*proxy_sqlite3_finalize)(stmt); +``` + +### JSON Response Pattern + +```cpp +// Use nlohmann/json +json result; +result["success"] = true; +result["data"] = data_array; +return result.dump(); +``` + +## Configuration Variable + +| Variable | Default | Description | +|----------|---------|-------------| +| `mcp-ftspath` | `mcp_fts.db` | Path to FTS SQLite database (relative or absolute) | + +**Usage**: +```sql +SET mcp-ftspath='/var/lib/proxysql/mcp_fts.db'; +``` + +## Agent Workflow Example + +```python +# Agent narrows down results using FTS +fts_results = call_tool("fts_search", { + "query": "urgent customer complaint", + "limit": 10 +}) + +# Extract primary keys from FTS results +order_ids = [r["primary_key_value"] for r in fts_results["results"]] + +# Query MySQL for full data +full_data = call_tool("run_sql_readonly", { + "sql": f"SELECT * FROM orders WHERE order_id IN ({','.join(order_ids)})" +}) +``` + +## Threading Considerations + +- SQLite3DB provides thread-safe read-write locks +- Use `wrlock()` for writes (index operations) +- Use `rdlock()` for reads (search operations) +- Follow the catalog pattern for locking + +## Performance Considerations + +1. **Batch inserts**: When indexing, insert rows in batches (100-1000 at a time) +2. **Table naming**: Sanitize schema/table names for SQLite table names +3. **Memory usage**: Large tables may require streaming results +4. **Index size**: Monitor FTS database size + +## Testing Checklist + +- [ ] Create index on single table +- [ ] Create index with WHERE clause +- [ ] Search single table +- [ ] Search across all tables +- [ ] List indexes +- [ ] Delete index +- [ ] Reindex single table +- [ ] Rebuild all indexes +- [ ] Test with NULL values +- [ ] Test with special characters in data +- [ ] Test pagination +- [ ] Test schema/table filtering + +## Notes + +- Follow existing patterns from `MySQL_Catalog` for SQLite management +- Use SQLite3DB read-write locks for thread safety +- Return JSON responses using nlohmann/json library +- Handle NULL values properly (use empty string as in execute_query) +- Use prepared statements for SQL safety +- Log errors using `proxy_error()` and info using `proxy_info()` +- Table name sanitization: replace `.` and special chars with `_` diff --git a/doc/MCP/Vector_Embeddings_Implementation_Plan.md b/doc/MCP/Vector_Embeddings_Implementation_Plan.md new file mode 100644 index 000000000..0be878068 --- /dev/null +++ b/doc/MCP/Vector_Embeddings_Implementation_Plan.md @@ -0,0 +1,884 @@ +# Vector Embeddings Implementation Plan + +## Overview + +This document describes the implementation of Vector Embeddings capabilities for the ProxySQL MCP Query endpoint. The Embeddings system enables AI agents to perform semantic similarity searches on database content using sqlite-vec for vector storage and sqlite-rembed for embedding generation. + +## Requirements + +1. **Embedding Generation**: Use sqlite-rembed (placeholder for future GenAI module) +2. **Vector Storage**: Use sqlite-vec extension (already compiled into ProxySQL) +3. **Search Type**: Semantic similarity search using vector distance +4. **Integration**: Work alongside FTS and Catalog for comprehensive search +5. **Use Case**: Find semantically similar content, not just keyword matches + +## Architecture + +``` +MCP Query Endpoint (JSON-RPC 2.0 over HTTPS) + ↓ +Query_Tool_Handler (routes tool calls) + ↓ +MySQL_Tool_Handler (implements tools) + ↓ +MySQL_Embeddings (new class - manages embeddings database) + ↓ +SQLite with sqlite-vec (mcp_embeddings.db) + ↓ +sqlite-rembed (embedding generation) + ↓ +External APIs (OpenAI, Ollama, Cohere, etc.) +``` + +## Database Design + +### Separate SQLite Database +**Path**: `mcp_embeddings.db` (configurable via `mcp-embeddingpath` variable) + +### Schema + +#### embedding_indexes (metadata table) +```sql +CREATE TABLE IF NOT EXISTS embedding_indexes ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + schema_name TEXT NOT NULL, + table_name TEXT NOT NULL, + columns TEXT NOT NULL, -- JSON array: ["col1", "col2"] + primary_key TEXT NOT NULL, -- PK column name for identification + where_clause TEXT, -- Optional WHERE filter + model_name TEXT NOT NULL, -- e.g., "text-embedding-3-small" + vector_dim INTEGER NOT NULL, -- e.g., 1536 for OpenAI small + embedding_strategy TEXT NOT NULL, -- "concat", "average", "separate" + row_count INTEGER DEFAULT 0, + indexed_at INTEGER DEFAULT (strftime('%s', 'now')), + UNIQUE(schema_name, table_name) +); + +CREATE INDEX IF NOT EXISTS idx_embedding_indexes_schema ON embedding_indexes(schema_name); +CREATE INDEX IF NOT EXISTS idx_embedding_indexes_table ON embedding_indexes(table_name); +CREATE INDEX IF NOT EXISTS idx_embedding_indexes_model ON embedding_indexes(model_name); +``` + +#### Per-Index vec0 Tables (created dynamically) + +For each indexed table, create a sqlite-vec virtual table: + +```sql +-- For OpenAI text-embedding-3-small (1536 dimensions) +CREATE VIRTUAL TABLE embeddings__ USING vec0( + vector float[1536], + pk_value TEXT, + metadata TEXT +); +``` + +**Table Components**: +- `vector` - The embedding vector (required by vec0) +- `pk_value` - Primary key value for MySQL lookup +- `metadata` - JSON with original row data + +**Sanitization**: +- Replace `.` and special characters with `_` +- Example: `testdb.orders` → `embeddings_testdb_orders` + +## Tools (6 total) + +### 1. embed_index_table + +Generate embeddings and create a vector index for a MySQL table. + +**Parameters**: +| Name | Type | Required | Description | +|------|------|----------|-------------| +| schema | string | Yes | Schema name | +| table | string | Yes | Table name | +| columns | string | Yes | JSON array of column names to embed | +| primary_key | string | Yes | Primary key column name | +| where_clause | string | No | Optional WHERE clause for filtering rows | +| model | string | Yes | Embedding model name (e.g., "text-embedding-3-small") | +| strategy | string | No | Embedding strategy: "concat" (default), "average", "separate" | + +**Embedding Strategies**: + +| Strategy | Description | When to Use | +|----------|-------------|-------------| +| `concat` | Concatenate all columns with spaces, generate one embedding | Most common, semantic meaning of combined content | +| `average` | Generate embedding per column, average them | Multiple independent columns | +| `separate` | Store embeddings separately per column | Need column-specific similarity | + +**Response**: +```json +{ + "success": true, + "schema": "testdb", + "table": "orders", + "model": "text-embedding-3-small", + "vector_dim": 1536, + "row_count": 5000, + "indexed_at": 1736668800 +} +``` + +**Implementation Logic**: +1. Validate parameters (table exists, columns valid) +2. Check if index already exists +3. Create vec0 table: `embeddings__` +4. Get vector dimension from model (or default to 1536) +5. Configure sqlite-rembed client (if not already configured) +6. Fetch all rows from MySQL using `execute_query()` +7. For each row: + - Build content string based on strategy + - Call `rembed()` to generate embedding + - Store vector + metadata in vec0 table +8. Update `embedding_indexes` metadata +9. Return result + +**Code Example (concat strategy)**: +```sql +-- Configure rembed client +INSERT INTO temp.rembed_clients(name, format, model, key) +VALUES ('mcp_embeddings', 'openai', 'text-embedding-3-small', 'sk-...'); + +-- Generate and insert embeddings +INSERT INTO embeddings_testdb_orders(rowid, vector, pk_value, metadata) +SELECT + ROWID, + rembed('mcp_embeddings', + COALESCE(customer_name, '') || ' ' || + COALESCE(product_name, '') || ' ' || + COALESCE(notes, '')) as vector, + CAST(order_id AS TEXT) as pk_value, + json_object( + 'order_id', order_id, + 'customer_name', customer_name, + 'notes', notes + ) as metadata +FROM testdb.orders +WHERE active = 1; +``` + +### 2. embed_search + +Perform semantic similarity search using vector embeddings. + +**Parameters**: +| Name | Type | Required | Description | +|------|------|----------|-------------| +| query | string | Yes | Search query text | +| schema | string | No | Filter by schema | +| table | string | No | Filter by table | +| limit | integer | No | Max results (default: 10) | +| min_distance | float | No | Maximum distance threshold (default: 1.0) | + +**Response**: +```json +{ + "success": true, + "query": "customer complaining about late delivery", + "query_embedding_dim": 1536, + "total_matches": 25, + "results": [ + { + "schema": "testdb", + "table": "orders", + "primary_key_value": "12345", + "distance": 0.234, + "metadata": { + "order_id": 12345, + "customer_name": "John Doe", + "notes": "Customer upset about delivery delay" + } + } + ] +} +``` + +**Implementation Logic**: +1. Generate embedding for query text using `rembed()` +2. Build SQL with vector similarity search +3. Apply schema/table filters if specified +4. Execute KNN search with distance threshold +5. Return ranked results with metadata + +**SQL Query Template**: +```sql +SELECT + e.pk_value as primary_key_value, + e.distance, + e.metadata +FROM embeddings_testdb_orders e +WHERE e.vector MATCH rembed('mcp_embeddings', ?) + AND e.distance < ? +ORDER BY e.distance ASC +LIMIT ?; +``` + +**Distance Metrics** (sqlite-vec supports): +- L2 (Euclidean) - default +- Cosine - for normalized vectors +- Hamming - for binary vectors + +### 3. embed_list_indexes + +List all embedding indexes with metadata. + +**Parameters**: None + +**Response**: +```json +{ + "success": true, + "indexes": [ + { + "schema": "testdb", + "table": "orders", + "columns": ["customer_name", "product_name", "notes"], + "primary_key": "order_id", + "model": "text-embedding-3-small", + "vector_dim": 1536, + "strategy": "concat", + "row_count": 5000, + "indexed_at": 1736668800 + } + ] +} +``` + +**Implementation Logic**: +1. Query `embedding_indexes` table +2. Return all indexes with metadata + +### 4. embed_delete_index + +Remove an embedding index. + +**Parameters**: +| Name | Type | Required | Description | +|------|------|----------|-------------| +| schema | string | Yes | Schema name | +| table | string | Yes | Table name | + +**Response**: +```json +{ + "success": true, + "schema": "testdb", + "table": "orders", + "message": "Embedding index deleted successfully" +} +``` + +**Implementation Logic**: +1. Validate index exists +2. Drop vec0 table +3. Remove metadata from `embedding_indexes` + +### 5. embed_reindex + +Refresh an embedding index with fresh data (full rebuild). + +**Parameters**: +| Name | Type | Required | Description | +|------|------|----------|-------------| +| schema | string | Yes | Schema name | +| table | string | Yes | Table name | + +**Response**: Same as `embed_index_table` + +**Implementation Logic**: +1. Fetch existing index metadata from `embedding_indexes` +2. Drop existing vec0 table +3. Re-create vec0 table +4. Call `embed_index_table` logic with stored metadata +5. Update `indexed_at` timestamp + +### 6. embed_rebuild_all + +Rebuild ALL embedding indexes with fresh data. + +**Parameters**: None + +**Response**: +```json +{ + "success": true, + "rebuilt_count": 3, + "failed": [ + { + "schema": "testdb", + "table": "products", + "error": "API rate limit exceeded" + } + ], + "indexes": [ + { + "schema": "testdb", + "table": "orders", + "row_count": 5100, + "status": "success" + } + ] +} +``` + +**Implementation Logic**: +1. Get all indexes from `embedding_indexes` table +2. For each index: + - Call `reindex()` with stored metadata + - Track success/failure +3. Return summary with rebuilt count and any failures + +## Implementation Steps + +### Phase 1: Foundation + +**Step 1: Create MySQL_Embeddings class** +- Create `include/MySQL_Embeddings.h` - Class header with method declarations +- Create `lib/MySQL_Embeddings.cpp` - Implementation +- Follow `MySQL_FTS` and `MySQL_Catalog` patterns + +**Step 2: Add configuration variable** +- Modify `include/MCP_Thread.h` - Add `mcp_embedding_path` to variables struct +- Modify `lib/MCP_Thread.cpp` - Add to `mcp_thread_variables_names` array +- Handle `embedding_path` in get/set variable functions +- Default value: `"mcp_embeddings.db"` + +**Step 3: Integrate Embeddings into MySQL_Tool_Handler** +- Add `MySQL_Embeddings* embeddings` member to `include/MySQL_Tool_Handler.h` +- Initialize in constructor with `embedding_path` +- Clean up in destructor +- Add Embeddings tool method declarations + +### Phase 2: Core Indexing + +**Step 4: Implement embed_index_table tool** +```cpp +// In MySQL_Embeddings class +std::string index_table( + const std::string& schema, + const std::string& table, + const std::string& columns, // JSON array + const std::string& primary_key, + const std::string& where_clause, + const std::string& model, + const std::string& strategy, + MySQL_Tool_Handler* mysql_handler +); +``` + +Key implementation details: +- Parse columns JSON array +- Create sanitized table name +- Create vec0 table with appropriate dimensions +- Configure sqlite-rembed client if needed +- Fetch data from MySQL +- Generate embeddings using `rembed()` function +- Insert into vec0 table +- Update metadata + +**GenAI Module Placeholder**: +```cpp +// For future GenAI module integration +// Currently uses sqlite-rembed +std::vector generate_embedding( + const std::string& text, + const std::string& model +) { + // PLACEHOLDER: Will call GenAI module when merged + // Currently: Use sqlite-rembed + + char* error = NULL; + std::string sql = "SELECT rembed('mcp_embeddings', ?) as embedding"; + + // Execute query, parse JSON array + // Return std::vector +} +``` + +**Step 5: Implement embed_list_indexes tool** +```cpp +std::string list_indexes(); +``` +Query `embedding_indexes` and return JSON array. + +**Step 6: Implement embed_delete_index tool** +```cpp +std::string delete_index(const std::string& schema, const std::string& table); +``` +Drop vec0 table and remove metadata. + +### Phase 3: Search Functionality + +**Step 7: Implement embed_search tool** +```cpp +std::string search( + const std::string& query, + const std::string& schema, + const std::string& table, + int limit, + float min_distance +); +``` + +SQL query template: +```sql +SELECT + e.pk_value, + e.distance, + e.metadata +FROM embeddings_ e +WHERE e.vector MATCH rembed('mcp_embeddings', ?) + AND e.distance < ? +ORDER BY e.distance ASC +LIMIT ?; +``` + +**Step 8: Implement embed_reindex tool** +```cpp +std::string reindex( + const std::string& schema, + const std::string& table, + MySQL_Tool_Handler* mysql_handler +); +``` +Fetch metadata, rebuild embeddings. + +**Step 9: Implement embed_rebuild_all tool** +```cpp +std::string rebuild_all(MySQL_Tool_Handler* mysql_handler); +``` +Loop through all indexes and rebuild each. + +### Phase 4: Tool Registration + +**Step 10: Register tools in Query_Tool_Handler** +- Modify `lib/Query_Tool_Handler.cpp` +- Add to `get_tool_list()`: + ```cpp + tools.push_back(create_tool_schema( + "embed_index_table", + "Generate embeddings and create vector index for a table", + {"schema", "table", "columns", "primary_key", "model"}, + {{"where_clause", "string"}, {"strategy", "string"}} + )); + // Repeat for all 6 tools + ``` +- Add routing in `execute_tool()`: + ```cpp + else if (tool_name == "embed_index_table") { + std::string schema = get_json_string(arguments, "schema"); + std::string table = get_json_string(arguments, "table"); + std::string columns = get_json_string(arguments, "columns"); + std::string primary_key = get_json_string(arguments, "primary_key"); + std::string where_clause = get_json_string(arguments, "where_clause"); + std::string model = get_json_string(arguments, "model"); + std::string strategy = get_json_string(arguments, "strategy", "concat"); + result_str = mysql_handler->embed_index_table(schema, table, columns, primary_key, where_clause, model, strategy); + } + // Repeat for other tools + ``` + +**Step 11: Update ProxySQL_MCP_Server** +- Modify `lib/ProxySQL_MCP_Server.cpp` +- Pass `embedding_path` when creating MySQL_Tool_Handler +- Initialize Embeddings: `mysql_handler->get_embeddings()->init()` + +### Phase 5: Build and Test + +**Step 12: Update build system** +- Modify `Makefile` +- Add `lib/MySQL_Embeddings.cpp` to compilation sources +- Verify link against sqlite3 (already includes vec.o) + +**Step 13: Testing** +- Test all 6 embed tools via MCP endpoint +- Verify JSON responses +- Test with actual MySQL data +- Test cross-table semantic search +- Test different embedding strategies +- Test with sqlite-rembed configured + +## Critical Files + +### New Files to Create +- `include/MySQL_Embeddings.h` - Embeddings class header +- `lib/MySQL_Embeddings.cpp` - Embeddings class implementation + +### Files to Modify +- `include/MySQL_Tool_Handler.h` - Add embeddings member and tool method declarations +- `lib/MySQL_Tool_Handler.cpp` - Add embeddings tool wrappers, initialize embeddings +- `lib/Query_Tool_Handler.cpp` - Register and route embeddings tools +- `include/MCP_Thread.h` - Add `mcp_embedding_path` variable +- `lib/MCP_Thread.cpp` - Handle `embedding_path` configuration +- `lib/ProxySQL_MCP_Server.cpp` - Pass `embedding_path` to MySQL_Tool_Handler +- `Makefile` - Add MySQL_Embeddings.cpp to build + +## Code Patterns to Follow + +### MySQL_Embeddings Class Structure + +```cpp +class MySQL_Embeddings { +private: + SQLite3DB* db; + std::string db_path; + + // Schema management + int init_schema(); + int create_tables(); + int create_embedding_table(const std::string& schema, + const std::string& table, + int vector_dim); + std::string get_table_name(const std::string& schema, + const std::string& table); + + // Embedding generation (placeholder for GenAI) + std::vector generate_embedding(const std::string& text, + const std::string& model); + + // Content building strategies + std::string build_content(const json& row, + const std::vector& columns, + const std::string& strategy); + +public: + MySQL_Embeddings(const std::string& path); + ~MySQL_Embeddings(); + + int init(); + void close(); + + // Tool methods + std::string index_table(...); + std::string search(...); + std::string list_indexes(); + std::string delete_index(...); + std::string reindex(...); + std::string rebuild_all(...); + + bool index_exists(const std::string& schema, const std::string& table); + SQLite3DB* get_db() { return db; } +}; +``` + +### sqlite-rembed Configuration + +```cpp +// Configure rembed client during initialization +int MySQL_Embeddings::init() { + // ... open database ... + + // Check if mcp rembed client exists + char* error = NULL; + std::string check_sql = "SELECT name FROM temp.rembed_clients WHERE name='mcp_embeddings'"; + + // If not exists, create default client + // (Requires API key to be configured separately by user) + + return 0; +} +``` + +### Vector Insert Example + +```cpp +// Insert embedding with content concatenation +std::string sql = + "INSERT INTO embeddings_testdb_orders(rowid, vector, pk_value, metadata) " + "SELECT " + " ROWID, " + " rembed('mcp_embeddings', ?) as vector, " + " CAST(order_id AS TEXT) as pk_value, " + " json_object('order_id', order_id, 'customer_name', customer_name) as metadata " + "FROM testdb.orders " + "WHERE active = 1"; + +// Execute with prepared statement +sqlite3_stmt* stmt; +db->prepare_v2(sql.c_str(), &stmt); +(*proxy_sqlite3_bind_text)(stmt, 1, content.c_str(), -1, SQLITE_TRANSIENT); +SAFE_SQLITE3_STEP2(stmt); +(*proxy_sqlite3_finalize)(stmt); +``` + +### Similarity Search Example + +```cpp +// Generate query embedding +std::vector query_vec = generate_embedding(query_text, model_name); +std::string query_vec_json = vector_to_json(query_vec); + +// Build search SQL +std::ostringstream sql; +sql << "SELECT pk_value, distance, metadata " + << "FROM embeddings_testdb_orders " + << "WHERE vector MATCH " << query_vec_json << " " + << "AND distance < " << min_distance << " " + << "ORDER BY distance ASC " + << "LIMIT " << limit; + +// Execute and return results +``` + +## Configuration Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `mcp-embeddingpath` | `mcp_embeddings.db` | Path to embeddings SQLite database | +| `mcp-rembed-client` | (none) | Default sqlite-rembed client name (user must configure) | + +**sqlite-rembed Configuration** (must be done by user): +```sql +-- Configure OpenAI client +INSERT INTO temp.rembed_clients(name, format, model, key) +VALUES ('mcp_embeddings', 'openai', 'text-embedding-3-small', 'sk-...'); + +-- Or local Ollama +INSERT INTO temp.rembed_clients(name, format, model, key) +VALUES ('mcp_embeddings', 'ollama', 'nomic-embed-text', ''); + +-- Or Cohere +INSERT INTO temp.rembed_clients(name, format, model, key) +VALUES ('mcp_embeddings', 'cohere', 'embed-english-v3.0', '...'); +``` + +## Model Support + +### Common Embedding Models + +| Model | Dimensions | Provider | Format | +|-------|------------|----------|--------| +| text-embedding-3-small | 1536 | OpenAI | openai | +| text-embedding-3-large | 3072 | OpenAI | openai | +| nomic-embed-text-v1.5 | 768 | Nomic | nomic | +| all-MiniLM-L6-v2 | 384 | Local (Ollama) | ollama | +| mxbai-embed-large-v1 | 1024 | MixedBread (Ollama) | ollama | + +### Vector Dimension Reference + +```cpp +// Map model names to dimensions +std::map model_dimensions = { + {"text-embedding-3-small", 1536}, + {"text-embedding-3-large", 3072}, + {"nomic-embed-text-v1.5", 768}, + {"all-MiniLM-L6-v2", 384}, + {"mxbai-embed-large-v1", 1024} +}; +``` + +## Agent Workflow Examples + +### Example 1: Semantic Search + +```python +# Agent finds semantically similar content +embed_results = call_tool("embed_search", { + "query": "customer unhappy with shipping delay", + "limit": 10 +}) + +# Extract primary keys +order_ids = [r["primary_key_value"] for r in embed_results["results"]] + +# Query MySQL for full data +full_orders = call_tool("run_sql_readonly", { + "sql": f"SELECT * FROM orders WHERE order_id IN ({','.join(order_ids)})" +}) +``` + +### Example 2: Combined FTS + Embeddings + +```python +# FTS for exact keyword match +keyword_results = call_tool("fts_search", { + "query": "refund request", + "limit": 50 +}) + +# Embeddings for semantic similarity +semantic_results = call_tool("embed_search", { + "query": "customer wants money back", + "limit": 50 +}) + +# Combine and deduplicate for best results +all_ids = set( + [r["primary_key_value"] for r in keyword_results["results"]] + + [r["primary_key_value"] for r in semantic_results["results"]] +) +``` + +### Example 3: RAG (Retrieval Augmented Generation) + +```python +# 1. Search for relevant documents +docs = call_tool("embed_search", { + "query": user_question, + "table": "knowledge_base", + "limit": 5 +}) + +# 2. Build context from retrieved documents +context = "\n".join([d["metadata"]["content"] for d in docs["results"]]) + +# 3. Generate answer using context +answer = call_llm({ + "prompt": f"Context: {context}\n\nQuestion: {user_question}\n\nAnswer:" +}) +``` + +## Comparison: FTS vs Embeddings + +| Aspect | FTS (fts_*) | Embeddings (embed_*) | +|--------|-------------|---------------------| +| **Search Type** | Lexical (keyword matching) | Semantic (similarity matching) | +| **Query Example** | "urgent order" | "customer complaint about late delivery" | +| **Technology** | SQLite FTS5 | sqlite-vec | +| **Storage** | Text content | Vector embeddings (float arrays) | +| **External API** | None | sqlite-rembed / GenAI module | +| **Speed** | Very fast | Fast (but API call latency) | +| **Use Cases** | Exact phrase matching, filters | Similar content, semantic understanding | +| **Strengths** | Fast, precise, works offline | Finds related content, handles synonyms | +| **Weaknesses** | Misses semantic matches | Requires API, slower, needs setup | + +## Performance Considerations + +### Embedding Generation +- **API Rate Limits**: OpenAI has rate limits (e.g., 3000 RPM) +- **Batch Processing**: sqlite-rembed doesn't support batching yet +- **Latency**: Each embedding = 1 HTTP call (50-500ms) +- **Cost**: OpenAI charges per token (e.g., $0.00002/1K tokens) + +### Vector Storage +- **Storage**: 1536 floats × 4 bytes = ~6KB per embedding +- **10,000 rows** = ~60MB for embeddings +- **Memory**: sqlite-vec loads vectors into memory for search + +### Search Performance +- **KNN Search**: O(n × d) where n=rows, d=dimensions +- **Typical**: < 100ms for 10K rows, < 1s for 1M rows +- **Limit**: Use LIMIT or `k = ?` constraint (required by vec0) + +## Best Practices + +### When to Use Embeddings +- **Semantic search**: Find similar meanings, not just keywords +- **Content recommendation**: "Users who liked X also liked Y" +- **Duplicate detection**: Find similar documents +- **Categorization**: Cluster similar content +- **RAG**: Retrieve relevant context for LLM + +### When to Use FTS +- **Exact matching**: Log search, code search +- **Filters**: Combined with WHERE clauses +- **Speed critical**: Sub-millisecond response needed +- **Offline**: No external API access + +### Column Selection +- **Choose meaningful columns**: Text that captures semantic meaning +- **Avoid IDs/numbers**: Order ID, timestamps (low semantic value) +- **Combine textually**: `title + description + notes` +- **Preprocess**: Remove HTML, special characters + +### Strategy Selection +- **concat**: Default, works for most use cases +- **average**: When columns have independent meaning +- **separate**: When need column-specific similarity + +## Testing Checklist + +### Basic Functionality +- [ ] Create embedding index (single table) +- [ ] Create embedding index with WHERE clause +- [ ] Create embedding index with average strategy +- [ ] Search single table +- [ ] Search across all tables +- [ ] List indexes +- [ ] Delete index +- [ ] Reindex single table +- [ ] Rebuild all indexes + +### Edge Cases +- [ ] Empty result sets +- [ ] NULL values in columns +- [ ] Special characters in text +- [ ] Very long text (>10K chars) +- [ ] Non-ASCII text (Unicode) +- [ ] API rate limiting +- [ ] API errors +- [ ] Invalid model names + +### Integration +- [ ] Works alongside FTS +- [ ] Works with catalog +- [ ] SQLite-vec extension loaded +- [ ] sqlite-rembed client configured +- [ ] Cross-table semantic search + +## GenAI Module Integration (Future) + +### Placeholder Interface + +```cpp +// When GenAI module is merged, replace sqlite-rembed calls +#ifdef HAVE_GENAI_MODULE + #include "GenAI_Module.h" +#endif + +std::vector MySQL_Embeddings::generate_embedding( + const std::string& text, + const std::string& model +) { +#ifdef HAVE_GENAI_MODULE + // Use GenAI module + return GenAI_Module::generate_embedding(text, model); +#else + // Use sqlite-rembed + std::string sql = "SELECT rembed('mcp_embeddings', ?) as embedding"; + // ... execute and parse ... + return parse_vector_from_json(result); +#endif +} +``` + +### Configuration for GenAI + +When GenAI module is available, add configuration variable: +```sql +SET mcp-genai-provider='local'; -- or 'openai', 'ollama', etc. +SET mcp-genai-model='nomic-embed-text-v1.5'; +``` + +## Troubleshooting + +### Common Issues + +**Issue**: "Error: no such table: temp.rembed_clients" +- **Cause**: sqlite-rembed extension not loaded +- **Fix**: Ensure sqlite-rembed is compiled and auto-registered + +**Issue**: "Error: rembed client not found" +- **Cause**: sqlite-rembed client not configured +- **Fix**: Run INSERT into temp.rembed_clients + +**Issue**: "Error: vector dimension mismatch" +- **Cause**: Model output doesn't match vec0 table dimensions +- **Fix**: Ensure vector_dim matches model output + +**Issue**: API rate limit exceeded +- **Cause**: Too many embedding requests +- **Fix**: Add delays, batch processing (when available), or use local model + +## Notes + +- Follow existing patterns from `MySQL_FTS` and `MySQL_Catalog` for SQLite management +- Use SQLite3DB read-write locks for thread safety +- Return JSON responses using nlohmann/json library +- Handle NULL values properly (use empty string as in execute_query) +- Use prepared statements for SQL safety +- Log errors using `proxy_error()` and info using `proxy_info()` +- Table name sanitization: replace `.` and special chars with `_` +- Always use LIMIT or `k = ?` in vec0 KNN queries (sqlite-vec requirement) +- Configure sqlite-rembed client before indexing +- Consider API costs and rate limits when planning bulk indexing