You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/doc/Two_Phase_Discovery_Impleme...

12 KiB

Two-Phase Schema Discovery Redesign - Implementation Summary

Overview

This document summarizes the implementation of the two-phase schema discovery redesign for ProxySQL MCP. The implementation transforms the previous LLM-only auto-discovery into a two-phase architecture:

  1. Phase 1: Static/Auto Discovery - Deterministic harvest from MySQL INFORMATION_SCHEMA
  2. Phase 2: LLM Agent Discovery - Semantic analysis using MCP tools only (NO file I/O)

Implementation Date

January 17, 2026

Files Created

Core Discovery Components

File Purpose
include/Discovery_Schema.h New catalog schema interface with deterministic + LLM layers
lib/Discovery_Schema.cpp Schema initialization with 20+ tables (runs, objects, columns, indexes, fks, profiles, FTS, LLM artifacts)
include/Static_Harvester.h Static harvester interface for deterministic metadata extraction
lib/Static_Harvester.cpp Deterministic metadata harvest from INFORMATION_SCHEMA (mirrors Python PoC)
include/Query_Tool_Handler.h REFACTORED: Now uses Discovery_Schema directly, includes 17 discovery tools
lib/Query_Tool_Handler.cpp REFACTORED: All query + discovery tools in unified handler

Prompt Files

File Purpose
scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/two_phase_discovery_prompt.md System prompt for LLM agent (staged discovery, MCP-only I/O)
scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/two_phase_user_prompt.md User prompt with discovery procedure
scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/two_phase_discovery.py Orchestration script wrapper for Claude Code

Files Modified

File Changes
include/Query_Tool_Handler.h COMPLETELY REWRITTEN: Now uses Discovery_Schema directly, includes MySQL connection pool
lib/Query_Tool_Handler.cpp COMPLETELY REWRITTEN: 37 tools (20 original + 17 discovery), direct catalog/harvester usage
lib/ProxySQL_MCP_Server.cpp Updated Query_Tool_Handler initialization (new constructor signature), removed Discovery_Tool_Handler
include/MCP_Thread.h Removed Discovery_Tool_Handler forward declaration and pointer
lib/Makefile Added Discovery_Schema.oo, Static_Harvester.oo (removed Discovery_Tool_Handler.oo)

Files Deleted

File Reason
include/Discovery_Tool_Handler.h Consolidated into Query_Tool_Handler
lib/Discovery_Tool_Handler.cpp Consolidated into Query_Tool_Handler

Architecture

IMPORTANT ARCHITECTURAL NOTE: All discovery tools are now available through the /mcp/query endpoint. The separate /mcp/discovery endpoint approach was removed in favor of consolidation. Query_Tool_Handler now:

  1. Uses Discovery_Schema directly (instead of wrapping MySQL_Tool_Handler)
  2. Includes MySQL connection pool for direct queries
  3. Provides all 37 tools (20 original + 17 discovery) through a single endpoint

Phase 1: Static Discovery (C++)

The Static_Harvester class performs deterministic metadata extraction:

MySQL INFORMATION_SCHEMA → Static_Harvester → Discovery_Schema SQLite

Harvest stages:

  1. Schemas (information_schema.SCHEMATA)
  2. Objects (information_schema.TABLES, ROUTINES)
  3. Columns (information_schema.COLUMNS) with derived hints (is_time, is_id_like)
  4. Indexes (information_schema.STATISTICS)
  5. Foreign Keys (KEY_COLUMN_USAGE, REFERENTIAL_CONSTRAINTS)
  6. View definitions (information_schema.VIEWS)
  7. Quick profiles (metadata-based analysis)
  8. FTS5 index rebuild

Derived field calculations:

Field Calculation
is_time data_type IN ('date','datetime','timestamp','time','year')
is_id_like `column_name REGEXP '(^id$
has_primary_key EXISTS (SELECT 1 FROM indexes WHERE is_primary=1)
has_foreign_keys EXISTS (SELECT 1 FROM foreign_keys WHERE child_object_id=?)
has_time_column EXISTS (SELECT 1 FROM columns WHERE is_time=1)

Phase 2: LLM Agent Discovery (MCP Tools)

The LLM agent (via Claude Code) performs semantic analysis using 18+ MCP tools:

Discovery Trigger (1 tool):

  • discovery.run_static - Triggers ProxySQL's static harvest

Catalog Tools (5 tools):

  • catalog.init - Initialize/migrate SQLite schema
  • catalog.search - FTS5 search over objects
  • catalog.get_object - Get object with columns/indexes/FKs
  • catalog.list_objects - List objects (paged)
  • catalog.get_relationships - Get FKs, view deps, inferred relationships

Agent Tools (3 tools):

  • agent.run_start - Create agent run bound to run_id
  • agent.run_finish - Mark agent run success/failed
  • agent.event_append - Log tool calls, results, decisions

LLM Memory Tools (9 tools):

  • llm.summary_upsert - Store semantic summary for object
  • llm.summary_get - Get semantic summary
  • llm.relationship_upsert - Store inferred relationship
  • llm.domain_upsert - Create/update domain
  • llm.domain_set_members - Set domain members
  • llm.metric_upsert - Store metric definition
  • llm.question_template_add - Add question template
  • llm.note_add - Add durable note
  • llm.search - FTS over LLM artifacts

Database Schema

Deterministic Layer Tables

Table Purpose
runs Track each discovery run (run_id, started_at, finished_at, source_dsn, mysql_version)
schemas Discovered MySQL schemas (schema_name, charset, collation)
objects Tables/views/routines/triggers with metadata (engine, rows_est, has_pk, has_fks, has_time)
columns Column details (data_type, is_nullable, is_pk, is_unique, is_indexed, is_time, is_id_like)
indexes Index metadata (is_unique, is_primary, index_type, cardinality)
index_columns Ordered index columns
foreign_keys FK relationships
foreign_key_columns Ordered FK columns
profiles Profiling results (JSON for extensibility)
fts_objects FTS5 index over objects (contentless)

LLM Agent Layer Tables

Table Purpose
agent_runs LLM agent runs (bound to deterministic run_id)
agent_events Tool calls, results, decisions (traceability)
llm_object_summaries Per-object semantic summaries (hypothesis, grain, dims/measures, joins)
llm_relationships LLM-inferred relationships with confidence
llm_domains Domain clusters (billing, sales, auth, etc.)
llm_domain_members Object-to-domain mapping with roles
llm_metrics Metric/KPI definitions
llm_question_templates NL → structured query plan mappings
llm_notes Free-form durable notes
fts_llm FTS5 over LLM artifacts

Usage

The two-phase discovery provides two ways to discover your database schema:

Phase 1: Static Harvest (Direct curl)

Phase 1 is a simple HTTP POST to trigger deterministic metadata extraction. No Claude Code required.

# Option A: Using the convenience script (recommended)
cd scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/
./static_harvest.sh --schema sales --notes "Production sales database discovery"

# Option B: Using curl directly
curl -k -X POST https://localhost:6071/mcp/query \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tools/call",
    "params": {
      "name": "discovery.run_static",
      "arguments": {
        "schema_filter": "sales",
        "notes": "Production sales database discovery"
      }
    }
  }'
# Returns: { run_id: 1, started_at: "...", objects_count: 45, columns_count: 380 }

Phase 2: LLM Agent Discovery (via two_phase_discovery.py)

Phase 2 uses Claude Code for semantic analysis. Requires MCP configuration.

# Step 1: Copy example MCP config and customize
cp scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/mcp_config.example.json mcp_config.json
# Edit mcp_config.json to set your PROXYSQL_MCP_ENDPOINT if needed

# Step 2: Run the two-phase discovery
./scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/two_phase_discovery.py \
    --mcp-config mcp_config.json \
    --schema sales \
    --model claude-3.5-sonnet

# Dry-run mode (preview without executing)
./scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/two_phase_discovery.py \
    --mcp-config mcp_config.json \
    --schema test \
    --dry-run

Direct MCP Tool Calls (via /mcp/query endpoint)

You can also call discovery tools directly via the MCP endpoint:

# All discovery tools are available via /mcp/query endpoint
curl -k -X POST https://localhost:6071/mcp/query \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tools/call",
    "params": {
      "name": "discovery.run_static",
      "arguments": {
        "schema_filter": "sales",
        "notes": "Production sales database discovery"
      }
    }
  }'
# Returns: { run_id: 1, started_at: "...", objects_count: 45, columns_count: 380 }

# Phase 2: LLM agent discovery
curl -k -X POST https://localhost:6071/mcp/query \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 2,
    "method": "tools/call",
    "params": {
      "name": "agent.run_start",
      "arguments": {
        "run_id": 1,
        "model_name": "claude-3.5-sonnet"
      }
    }
  }'
# Returns: { agent_run_id: 1 }

Discovery Workflow

Stage 0: Start and plan
├─> discovery.run_static() → run_id
├─> agent.run_start(run_id) → agent_run_id
└─> agent.event_append(plan, budgets)

Stage 1: Triage and prioritization
└─> catalog.list_objects() + catalog.search() → build prioritized backlog

Stage 2: Per-object semantic summarization
└─> catalog.get_object() + catalog.get_relationships()
    └─> llm.summary_upsert() (50+ high-value objects)

Stage 3: Relationship enhancement
└─> llm.relationship_upsert() (where FKs missing or unclear)

Stage 4: Domain clustering and synthesis
└─> llm.domain_upsert() + llm.domain_set_members()
    └─> llm.note_add(domain descriptions)

Stage 5: "Answerability" artifacts
├─> llm.metric_upsert() (10-30 metrics)
└─> llm.question_template_add() (15-50 question templates)

Shutdown:
├─> agent.event_append(final_summary)
└─> agent.run_finish(success)

Quality Rules

Confidence scores:

  • 0.91.0: supported by schema + constraints or very strong evidence
  • 0.60.8: likely, supported by multiple signals but not guaranteed
  • 0.30.5: tentative hypothesis; mark warnings and what's needed to confirm

Critical Constraint: NO FILES

  • LLM agent MUST NOT create/read/modify any local files
  • All outputs MUST be persisted exclusively via MCP tools
  • Use agent_events and llm_notes as scratchpad

Verification

To verify the implementation:

# Build ProxySQL
cd /home/rene/proxysql-vec
make -j$(nproc)

# Verify new discovery components exist
ls -la include/Discovery_Schema.h include/Static_Harvester.h
ls -la lib/Discovery_Schema.cpp lib/Static_Harvester.cpp

# Verify Discovery_Tool_Handler was removed (should return nothing)
ls include/Discovery_Tool_Handler.h 2>&1 # Should fail
ls lib/Discovery_Tool_Handler.cpp 2>&1   # Should fail

# Verify Query_Tool_Handler uses Discovery_Schema
grep -n "Discovery_Schema" include/Query_Tool_Handler.h
grep -n "Static_Harvester" include/Query_Tool_Handler.h

# Verify Query_Tool_Handler has discovery tools
grep -n "discovery.run_static" lib/Query_Tool_Handler.cpp
grep -n "agent.run_start" lib/Query_Tool_Handler.cpp
grep -n "llm.summary_upsert" lib/Query_Tool_Handler.cpp

# Test Phase 1 (curl)
curl -k -X POST https://localhost:6071/mcp/query \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"discovery.run_static","arguments":{"schema_filter":"test"}}}'
# Should return: { run_id: 1, objects_count: X, columns_count: Y }

# Test Phase 2 (two_phase_discovery.py)
cd scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/
cp mcp_config.example.json mcp_config.json
./two_phase_discovery.py --dry-run --mcp-config mcp_config.json --schema test

Next Steps

  1. Build and test: Compile ProxySQL and test with a small database
  2. Integration testing: Test with medium database (100+ tables)
  3. Documentation updates: Update main README and MCP docs
  4. Migration guide: Document transition from legacy 6-agent to new two-phase system

References

  • Python PoC: /tmp/mysql_autodiscovery_poc.py
  • Schema specification: /tmp/schema.sql
  • MCP tools specification: /tmp/mcp_tools_discovery_catalog.json
  • System prompt reference: /tmp/system_prompt.md
  • User prompt reference: /tmp/user_prompt.md