You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/doc/MCP/FTS_Implementation_Plan.md

336 lines
9.0 KiB

# Full Text Search (FTS) Implementation Status
## Overview
This document describes the current implementation of Full Text Search (FTS) capabilities in ProxySQL MCP. The FTS system enables AI agents to quickly search indexed database metadata and LLM-generated artifacts using SQLite's FTS5 extension.
**Status: IMPLEMENTED**
## Requirements
1. **Indexing Strategy**: Optional WHERE clauses, no incremental updates (full rebuild on reindex)
2. **Search Scope**: Agent decides - single table or cross-table search
3. **Storage**: All rows (no limits)
4. **Catalog Integration**: Cross-reference between FTS and catalog - agent can use FTS to get top N IDs, then query real database
5. **Use Case**: FTS as another tool in the agent's toolkit
## Architecture
### Components
```
MCP Query Endpoint
Query_Tool_Handler (routes tool calls)
Discovery_Schema (manages FTS database)
SQLite FTS5 (mcp_catalog.db)
```
### Database Design
**Integrated with Discovery Schema**: FTS functionality is built into the existing `mcp_catalog.db` database.
**FTS Tables**:
- `fts_objects` - FTS5 index over database objects (contentless)
- `fts_llm` - FTS5 index over LLM-generated artifacts (with content)
## Tools (Integrated with Discovery Tools)
### 1. catalog_search
Search indexed data using FTS5 across both database objects and LLM artifacts.
**Parameters**:
| Name | Type | Required | Description |
|------|------|----------|-------------|
| query | string | Yes | FTS5 search query |
| include_objects | boolean | No | Include detailed object information (default: false) |
| object_limit | integer | No | Max objects to return when include_objects=true (default: 50) |
**Response**:
```json
{
"success": true,
"query": "customer order",
"results": [
{
"kind": "table",
"key": "sales.orders",
"schema_name": "sales",
"object_name": "orders",
"content": "orders table with columns: order_id, customer_id, order_date, total_amount",
"rank": 0.5
}
]
}
```
**Implementation Logic**:
1. Search both `fts_objects` and `fts_llm` tables using FTS5
2. Combine results with ranking
3. Optionally fetch detailed object information
4. Return ranked results
### 2. llm.search
Search LLM-generated content and insights using FTS5.
**Parameters**:
| Name | Type | Required | Description |
|------|------|----------|-------------|
| query | string | Yes | FTS5 search query |
| type | string | No | Content type to search ("summary", "relationship", "domain", "metric", "note") |
| schema | string | No | Filter by schema |
| limit | integer | No | Maximum results (default: 10) |
**Response**:
```json
{
"success": true,
"query": "customer segmentation",
"results": [
{
"kind": "domain",
"key": "customer_segmentation",
"content": "Customer segmentation based on purchase behavior and demographics",
"rank": 0.8
}
]
}
```
**Implementation Logic**:
1. Search `fts_llm` table using FTS5
2. Apply filters if specified
3. Return ranked results with content
### 3. catalog_search (Detailed)
Search indexed data using FTS5 across both database objects and LLM artifacts with detailed object information.
**Parameters**:
| Name | Type | Required | Description |
|------|------|----------|-------------|
| query | string | Yes | FTS5 search query |
| include_objects | boolean | No | Include detailed object information (default: false) |
| object_limit | integer | No | Max objects to return when include_objects=true (default: 50) |
**Response**:
```json
{
"success": true,
"query": "customer order",
"results": [
{
"kind": "table",
"key": "sales.orders",
"schema_name": "sales",
"object_name": "orders",
"content": "orders table with columns: order_id, customer_id, order_date, total_amount",
"rank": 0.5,
"details": {
"object_id": 123,
"object_type": "table",
"schema_name": "sales",
"object_name": "orders",
"row_count_estimate": 15000,
"has_primary_key": true,
"has_foreign_keys": true,
"has_time_column": true,
"columns": [
{
"column_name": "order_id",
"data_type": "int",
"is_nullable": false,
"is_primary_key": true
}
]
}
}
]
}
```
**Implementation Logic**:
1. Search both `fts_objects` and `fts_llm` tables using FTS5
2. Combine results with ranking
3. Optionally fetch detailed object information from `objects`, `columns`, `indexes`, `foreign_keys` tables
4. Return ranked results with detailed information when requested
## Database Schema
### fts_objects (contentless FTS5 table)
```sql
CREATE VIRTUAL TABLE fts_objects USING fts5(
schema_name,
object_name,
object_type,
content,
content='',
content_rowid='object_id'
);
```
### fts_llm (FTS5 table with content)
```sql
CREATE VIRTUAL TABLE fts_llm USING fts5(
kind,
key,
content
);
```
## Implementation Status
### Phase 1: Foundation ✅ COMPLETED
**Step 1: Integrate FTS into Discovery_Schema**
- FTS functionality built into `lib/Discovery_Schema.cpp`
- Uses existing `mcp_catalog.db` database
- No separate configuration variable needed
**Step 2: Create FTS tables**
- `fts_objects` for database objects (contentless)
- `fts_llm` for LLM artifacts (with content)
### Phase 2: Core Indexing ✅ COMPLETED
**Step 3: Implement automatic indexing**
- Objects automatically indexed during static harvest
- LLM artifacts automatically indexed during upsert operations
### Phase 3: Search Functionality ✅ COMPLETED
**Step 4: Implement search tools**
- `catalog_search` tool in Query_Tool_Handler
- `llm.search` tool in Query_Tool_Handler
### Phase 4: Tool Registration ✅ COMPLETED
**Step 5: Register tools**
- Tools registered in Query_Tool_Handler::get_tool_list()
- Tools routed in Query_Tool_Handler::execute_tool()
## Critical Files
### Files Modified
- `include/Discovery_Schema.h` - Added FTS methods
- `lib/Discovery_Schema.cpp` - Implemented FTS functionality
- `lib/Query_Tool_Handler.cpp` - Added FTS tool routing
- `include/Query_Tool_Handler.h` - Added FTS tool declarations
## Current Implementation Details
### FTS Integration Pattern
```cpp
class Discovery_Schema {
private:
// FTS methods
int create_fts_tables();
int rebuild_fts_index(int run_id);
json search_fts(const std::string& query, bool include_objects = false, int object_limit = 50);
json search_llm_fts(const std::string& query, const std::string& type = "",
const std::string& schema = "", int limit = 10);
public:
// FTS is automatically maintained during:
// - Object insertion (static harvest)
// - LLM artifact upsertion
// - Catalog rebuild operations
};
```
### Error Handling Pattern
```cpp
json result;
result["success"] = false;
result["error"] = "Descriptive error message";
return result;
// Logging
proxy_error("FTS error: %s\n", error_msg);
proxy_info("FTS search completed: %zu results\n", result_count);
```
### SQLite Operations Pattern
```cpp
db->wrlock();
// Write operations (indexing)
db->wrunlock();
db->rdlock();
// Read operations (search)
db->rdunlock();
// Prepared statements
sqlite3_stmt* stmt = NULL;
db->prepare_v2(sql, &stmt);
(*proxy_sqlite3_bind_text)(stmt, 1, value.c_str(), -1, SQLITE_TRANSIENT);
SAFE_SQLITE3_STEP2(stmt);
(*proxy_sqlite3_finalize)(stmt);
```
## Agent Workflow Example
```python
# Agent searches for relevant objects
search_results = call_tool("catalog_search", {
"query": "customer orders with high value",
"include_objects": True,
"object_limit": 20
})
# Agent searches for LLM insights
llm_results = call_tool("llm.search", {
"query": "customer segmentation",
"type": "domain"
})
# Agent uses results to build understanding
for result in search_results["results"]:
if result["kind"] == "table":
# Get detailed table information
table_details = call_tool("catalog_get_object", {
"schema": result["schema_name"],
"object": result["object_name"]
})
```
## Performance Considerations
1. **Contentless FTS**: `fts_objects` uses contentless indexing for performance
2. **Automatic Maintenance**: FTS indexes automatically maintained during operations
3. **Ranking**: Results ranked using FTS5 bm25 algorithm
4. **Pagination**: Large result sets automatically paginated
## Testing Status ✅ COMPLETED
- [x] Search database objects using FTS
- [x] Search LLM artifacts using FTS
- [x] Combined search with ranking
- [x] Detailed object information retrieval
- [x] Filter by content type
- [x] Filter by schema
- [x] Performance with large catalogs
- [x] Error handling
## Notes
- FTS5 requires SQLite with FTS5 extension enabled
- Contentless FTS for objects provides fast search without duplicating data
- LLM artifacts stored directly in FTS table for full content search
- Automatic FTS maintenance ensures indexes are always current
- Ranking uses FTS5's built-in bm25 algorithm for relevance scoring
## Version
- **Last Updated:** 2026-01-19
- **Implementation Date:** January 2026
- **Status:** Fully implemented and tested