12 KiB

Raw Blame History Unescape Escape

Two-Phase Schema Discovery Redesign - Implementation Summary

Overview

This document summarizes the implementation of the two-phase schema discovery redesign for ProxySQL MCP. The implementation transforms the previous LLM-only auto-discovery into a two-phase architecture:

Phase 1: Static/Auto Discovery - Deterministic harvest from MySQL INFORMATION_SCHEMA
Phase 2: LLM Agent Discovery - Semantic analysis using MCP tools only (NO file I/O)

Implementation Date

January 17, 2026

Files Created

Core Discovery Components

File	Purpose
`include/Discovery_Schema.h`	New catalog schema interface with deterministic + LLM layers
`lib/Discovery_Schema.cpp`	Schema initialization with 20+ tables (runs, objects, columns, indexes, fks, profiles, FTS, LLM artifacts)
`include/Static_Harvester.h`	Static harvester interface for deterministic metadata extraction
`lib/Static_Harvester.cpp`	Deterministic metadata harvest from INFORMATION_SCHEMA (mirrors Python PoC)
`include/Query_Tool_Handler.h`	REFACTORED: Now uses Discovery_Schema directly, includes 17 discovery tools
`lib/Query_Tool_Handler.cpp`	REFACTORED: All query + discovery tools in unified handler

Prompt Files

File	Purpose
`scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/two_phase_discovery_prompt.md`	System prompt for LLM agent (staged discovery, MCP-only I/O)
`scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/two_phase_user_prompt.md`	User prompt with discovery procedure
`scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/two_phase_discovery.py`	Orchestration script wrapper for Claude Code

Files Modified

File	Changes
`include/Query_Tool_Handler.h`	COMPLETELY REWRITTEN: Now uses Discovery_Schema directly, includes MySQL connection pool
`lib/Query_Tool_Handler.cpp`	COMPLETELY REWRITTEN: 37 tools (20 original + 17 discovery), direct catalog/harvester usage
`lib/ProxySQL_MCP_Server.cpp`	Updated Query_Tool_Handler initialization (new constructor signature), removed Discovery_Tool_Handler
`include/MCP_Thread.h`	Removed Discovery_Tool_Handler forward declaration and pointer
`lib/Makefile`	Added Discovery_Schema.oo, Static_Harvester.oo (removed Discovery_Tool_Handler.oo)

Files Deleted

File	Reason
`include/Discovery_Tool_Handler.h`	Consolidated into Query_Tool_Handler
`lib/Discovery_Tool_Handler.cpp`	Consolidated into Query_Tool_Handler

Architecture

IMPORTANT ARCHITECTURAL NOTE: All discovery tools are now available through the /mcp/query endpoint. The separate /mcp/discovery endpoint approach was removed in favor of consolidation. Query_Tool_Handler now:

Uses Discovery_Schema directly (instead of wrapping MySQL_Tool_Handler)
Includes MySQL connection pool for direct queries
Provides all 37 tools (20 original + 17 discovery) through a single endpoint

Phase 1: Static Discovery (C++)

The Static_Harvester class performs deterministic metadata extraction:

MySQL INFORMATION_SCHEMA → Static_Harvester → Discovery_Schema SQLite

Harvest stages:

Schemas (information_schema.SCHEMATA)
Objects (information_schema.TABLES, ROUTINES)
Columns (information_schema.COLUMNS) with derived hints (is_time, is_id_like)
Indexes (information_schema.STATISTICS)
Foreign Keys (KEY_COLUMN_USAGE, REFERENTIAL_CONSTRAINTS)
View definitions (information_schema.VIEWS)
Quick profiles (metadata-based analysis)
FTS5 index rebuild

Derived field calculations:

Field	Calculation
`is_time`	`data_type IN ('date','datetime','timestamp','time','year')`
`is_id_like`	`column_name REGEXP '(^id$
`has_primary_key`	`EXISTS (SELECT 1 FROM indexes WHERE is_primary=1)`
`has_foreign_keys`	`EXISTS (SELECT 1 FROM foreign_keys WHERE child_object_id=?)`
`has_time_column`	`EXISTS (SELECT 1 FROM columns WHERE is_time=1)`

Phase 2: LLM Agent Discovery (MCP Tools)

The LLM agent (via Claude Code) performs semantic analysis using 18+ MCP tools:

Discovery Trigger (1 tool):

discovery.run_static - Triggers ProxySQL's static harvest

Catalog Tools (5 tools):

catalog.init - Initialize/migrate SQLite schema
catalog.search - FTS5 search over objects
catalog.get_object - Get object with columns/indexes/FKs
catalog.list_objects - List objects (paged)
catalog.get_relationships - Get FKs, view deps, inferred relationships

Agent Tools (3 tools):

agent.run_start - Create agent run bound to run_id
agent.run_finish - Mark agent run success/failed
agent.event_append - Log tool calls, results, decisions

LLM Memory Tools (9 tools):

llm.summary_upsert - Store semantic summary for object
llm.summary_get - Get semantic summary
llm.relationship_upsert - Store inferred relationship
llm.domain_upsert - Create/update domain
llm.domain_set_members - Set domain members
llm.metric_upsert - Store metric definition
llm.question_template_add - Add question template
llm.note_add - Add durable note
llm.search - FTS over LLM artifacts

Database Schema

Deterministic Layer Tables

Table	Purpose
`runs`	Track each discovery run (run_id, started_at, finished_at, source_dsn, mysql_version)
`schemas`	Discovered MySQL schemas (schema_name, charset, collation)
`objects`	Tables/views/routines/triggers with metadata (engine, rows_est, has_pk, has_fks, has_time)
`columns`	Column details (data_type, is_nullable, is_pk, is_unique, is_indexed, is_time, is_id_like)
`indexes`	Index metadata (is_unique, is_primary, index_type, cardinality)
`index_columns`	Ordered index columns
`foreign_keys`	FK relationships
`foreign_key_columns`	Ordered FK columns
`profiles`	Profiling results (JSON for extensibility)
`fts_objects`	FTS5 index over objects (contentless)

LLM Agent Layer Tables

Table	Purpose
`agent_runs`	LLM agent runs (bound to deterministic run_id)
`agent_events`	Tool calls, results, decisions (traceability)
`llm_object_summaries`	Per-object semantic summaries (hypothesis, grain, dims/measures, joins)
`llm_relationships`	LLM-inferred relationships with confidence
`llm_domains`	Domain clusters (billing, sales, auth, etc.)
`llm_domain_members`	Object-to-domain mapping with roles
`llm_metrics`	Metric/KPI definitions
`llm_question_templates`	NL → structured query plan mappings
`llm_notes`	Free-form durable notes
`fts_llm`	FTS5 over LLM artifacts

Usage

The two-phase discovery provides two ways to discover your database schema:

Phase 1: Static Harvest (Direct curl)

Phase 1 is a simple HTTP POST to trigger deterministic metadata extraction. No Claude Code required.

# Option A: Using the convenience script (recommended)
cd scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/
./static_harvest.sh --schema sales --notes "Production sales database discovery"

# Option B: Using curl directly
curl -k -X POST https://localhost:6071/mcp/query \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tools/call",
    "params": {
      "name": "discovery.run_static",
      "arguments": {
        "schema_filter": "sales",
        "notes": "Production sales database discovery"
      }
    }
  }'
# Returns: { run_id: 1, started_at: "...", objects_count: 45, columns_count: 380 }

Phase 2: LLM Agent Discovery (via two_phase_discovery.py)

Phase 2 uses Claude Code for semantic analysis. Requires MCP configuration.

# Step 1: Copy example MCP config and customize
cp scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/mcp_config.example.json mcp_config.json
# Edit mcp_config.json to set your PROXYSQL_MCP_ENDPOINT if needed

# Step 2: Run the two-phase discovery
./scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/two_phase_discovery.py \
    --mcp-config mcp_config.json \
    --schema sales \
    --model claude-3.5-sonnet

# Dry-run mode (preview without executing)
./scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/two_phase_discovery.py \
    --mcp-config mcp_config.json \
    --schema test \
    --dry-run

Direct MCP Tool Calls (via /mcp/query endpoint)

You can also call discovery tools directly via the MCP endpoint:

# All discovery tools are available via /mcp/query endpoint
curl -k -X POST https://localhost:6071/mcp/query \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tools/call",
    "params": {
      "name": "discovery.run_static",
      "arguments": {
        "schema_filter": "sales",
        "notes": "Production sales database discovery"
      }
    }
  }'
# Returns: { run_id: 1, started_at: "...", objects_count: 45, columns_count: 380 }

# Phase 2: LLM agent discovery
curl -k -X POST https://localhost:6071/mcp/query \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 2,
    "method": "tools/call",
    "params": {
      "name": "agent.run_start",
      "arguments": {
        "run_id": 1,
        "model_name": "claude-3.5-sonnet"
      }
    }
  }'
# Returns: { agent_run_id: 1 }

Discovery Workflow

Stage 0: Start and plan
├─> discovery.run_static() → run_id
├─> agent.run_start(run_id) → agent_run_id
└─> agent.event_append(plan, budgets)

Stage 1: Triage and prioritization
└─> catalog.list_objects() + catalog.search() → build prioritized backlog

Stage 2: Per-object semantic summarization
└─> catalog.get_object() + catalog.get_relationships()
    └─> llm.summary_upsert() (50+ high-value objects)

Stage 3: Relationship enhancement
└─> llm.relationship_upsert() (where FKs missing or unclear)

Stage 4: Domain clustering and synthesis
└─> llm.domain_upsert() + llm.domain_set_members()
    └─> llm.note_add(domain descriptions)

Stage 5: "Answerability" artifacts
├─> llm.metric_upsert() (10-30 metrics)
└─> llm.question_template_add() (15-50 question templates)

Shutdown:
├─> agent.event_append(final_summary)
└─> agent.run_finish(success)

Quality Rules

Confidence scores:

0.9–1.0: supported by schema + constraints or very strong evidence
0.6–0.8: likely, supported by multiple signals but not guaranteed
0.3–0.5: tentative hypothesis; mark warnings and what's needed to confirm

Critical Constraint: NO FILES

LLM agent MUST NOT create/read/modify any local files
All outputs MUST be persisted exclusively via MCP tools
Use agent_events and llm_notes as scratchpad

Verification

To verify the implementation:

# Build ProxySQL
cd /home/rene/proxysql-vec
make -j$(nproc)

# Verify new discovery components exist
ls -la include/Discovery_Schema.h include/Static_Harvester.h
ls -la lib/Discovery_Schema.cpp lib/Static_Harvester.cpp

# Verify Discovery_Tool_Handler was removed (should return nothing)
ls include/Discovery_Tool_Handler.h 2>&1 # Should fail
ls lib/Discovery_Tool_Handler.cpp 2>&1   # Should fail

# Verify Query_Tool_Handler uses Discovery_Schema
grep -n "Discovery_Schema" include/Query_Tool_Handler.h
grep -n "Static_Harvester" include/Query_Tool_Handler.h

# Verify Query_Tool_Handler has discovery tools
grep -n "discovery.run_static" lib/Query_Tool_Handler.cpp
grep -n "agent.run_start" lib/Query_Tool_Handler.cpp
grep -n "llm.summary_upsert" lib/Query_Tool_Handler.cpp

# Test Phase 1 (curl)
curl -k -X POST https://localhost:6071/mcp/query \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"discovery.run_static","arguments":{"schema_filter":"test"}}}'
# Should return: { run_id: 1, objects_count: X, columns_count: Y }

# Test Phase 2 (two_phase_discovery.py)
cd scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/
cp mcp_config.example.json mcp_config.json
./two_phase_discovery.py --dry-run --mcp-config mcp_config.json --schema test

Next Steps

Build and test: Compile ProxySQL and test with a small database
Integration testing: Test with medium database (100+ tables)
Documentation updates: Update main README and MCP docs
Migration guide: Document transition from legacy 6-agent to new two-phase system

References

Python PoC: /tmp/mysql_autodiscovery_poc.py
Schema specification: /tmp/schema.sql
MCP tools specification: /tmp/mcp_tools_discovery_catalog.json
System prompt reference: /tmp/system_prompt.md
User prompt reference: /tmp/user_prompt.md

12 KiB Raw Blame History Unescape Escape