You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/scripts/mcp/DiscoveryAgent/ClaudeCode_Headless
Rene Cannao ade0130e67
discoveryagent: update Claude Code headless flow for target_id-scoped MCP tools
5 days ago
..
examples Move discovery docs to examples directory 1 month ago
prompts discoveryagent: update Claude Code headless flow for target_id-scoped MCP tools 5 days ago
.gitignore chore: Add comprehensive gitignore for discovery output files 1 month ago
README.md discoveryagent: update Claude Code headless flow for target_id-scoped MCP tools 5 days ago
headless_db_discovery.py fix: Correct log message from 4-agent to 6-agent discovery 1 month ago
headless_db_discovery.sh fix: Correct log message from 4-agent to 6-agent discovery 1 month ago
mcp_config.example.json Fix two-phase discovery documentation and scripts 1 month ago
static_harvest.sh discoveryagent: update Claude Code headless flow for target_id-scoped MCP tools 5 days ago
test_catalog.sh discoveryagent: update Claude Code headless flow for target_id-scoped MCP tools 5 days ago
two_phase_discovery.py discoveryagent: update Claude Code headless flow for target_id-scoped MCP tools 5 days ago

README.md

Headless Database Discovery with Claude Code

Database discovery systems for comprehensive analysis through MCP (Model Context Protocol).

This directory contains two separate discovery approaches:

Approach Description When to Use
Two-Phase Discovery Static harvest + LLM semantic analysis (NEW) Quick, efficient discovery with semantic insights
Multi-Agent Discovery 6-agent collaborative analysis Deep, comprehensive analysis (legacy)

Overview

The two-phase discovery provides fast, efficient database schema discovery:

Phase 1: Static Harvest (C++)

  • Deterministic metadata extraction from INFORMATION_SCHEMA
  • Simple curl command - no Claude Code required
  • Returns: run_id, objects_count, columns_count, indexes_count, etc.

Phase 2: LLM Agent Discovery (Optional)

  • Semantic analysis using Claude Code
  • Generates summaries, domains, metrics, and question templates
  • Requires MCP configuration

Quick Start

cd scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/

# Phase 1: Static harvest (no Claude Code needed)

# Option A: Using the convenience script (recommended)
./static_harvest.sh --target-id tap_mysql_default --schema test

# Option B: Using curl directly
curl -k -X POST https://localhost:6071/mcp/query \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tools/call",
    "params": {
      "name": "discovery.run_static",
      "arguments": {
        "target_id": "tap_mysql_default",
        "schema_filter": "test"
      }
    }
  }'

# Phase 2: LLM agent discovery (requires Claude Code)
cp mcp_config.example.json mcp_config.json
./two_phase_discovery.py \
    --mcp-config mcp_config.json \
    --target-id tap_mysql_default \
    --schema test \
    --dry-run  # Preview without executing

Files

File Purpose
two_phase_discovery.py Orchestration script for Phase 2
mcp_config.example.json Example MCP configuration for Claude Code
prompts/two_phase_discovery_prompt.md System prompt for LLM agent
prompts/two_phase_user_prompt.md User prompt template

Documentation

See Two_Phase_Discovery_Implementation.md for complete implementation details.


Multi-Agent Discovery (Legacy)

Multi-agent database discovery system for comprehensive analysis through MCP (Model Context Protocol).

Overview

This directory contains scripts for running 6-agent collaborative database discovery in headless (non-interactive) mode using Claude Code.

Key Features:

  • 6 Agents (5 Analysis + 1 Meta): STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, META
  • 5-Round Protocol: Blind exploration → Pattern recognition → Hypothesis testing → Final synthesis → Meta analysis
  • MCP Catalog Collaboration: Agents share findings via catalog
  • Comprehensive Reports: Structured markdown with health scores and prioritized recommendations
  • Evidence-Based: 20+ hypothesis validations with direct database evidence
  • Self-Improving: META agent analyzes report quality and suggests prompt improvements

Quick Start

# Basic discovery - discovers the first available database
python ./headless_db_discovery.py

# Discover a specific database
python ./headless_db_discovery.py --database mydb

# Specify output file
python ./headless_db_discovery.py --output my_report.md

# With verbose output
python ./headless_db_discovery.py --verbose

Using the Bash Script

# Basic discovery
./headless_db_discovery.sh

# Discover specific database
./headless_db_discovery.sh -d mydb

# With custom timeout
./headless_db_discovery.sh -t 600

Multi-Agent Discovery Architecture

The 6 Agents

Agent Type Focus Key MCP Tools
STRUCTURAL Analysis Schemas, tables, relationships, indexes, constraints list_schemas, list_tables, describe_table, get_constraints, suggest_joins
STATISTICAL Analysis Data distributions, quality, anomalies table_profile, sample_rows, column_profile, sample_distinct, run_sql_readonly
SEMANTIC Analysis Business domain, entities, rules, terminology sample_rows, sample_distinct, run_sql_readonly
QUERY Analysis Index efficiency, query patterns, optimization describe_table, explain_sql, suggest_joins, run_sql_readonly
SECURITY Analysis Sensitive data, access patterns, vulnerabilities sample_rows, sample_distinct, column_profile, run_sql_readonly
META Meta Report quality analysis, prompt improvement suggestions catalog_search, catalog_get (reads findings)

5-Round Protocol

  1. Round 1: Blind Exploration (Parallel)

    • All 5 analysis agents explore independently
    • Each discovers patterns without seeing others' findings
    • Findings written to MCP catalog
  2. Round 2: Pattern Recognition (Collaborative)

    • All 5 analysis agents read each other's findings via catalog_search
    • Identify cross-cutting patterns and anomalies
    • Collaborative analysis documented
  3. Round 3: Hypothesis Testing (Validation)

    • Each analysis agent validates 3-4 specific hypotheses
    • Results documented with PASS/FAIL/MIXED and evidence
    • 20+ hypothesis validations total
  4. Round 4: Final Synthesis

    • All 5 analysis agents synthesize findings into comprehensive report
    • Written to MCP catalog and local file
  5. Round 5: Meta Analysis (META agent only)

    • META agent reads the complete final report
    • Analyzes each section for depth, completeness, quality
    • Identifies gaps and suggests prompt improvements
    • Writes separate meta-analysis document to MCP catalog

What Gets Discovered

1. Structural Analysis

  • Complete table schemas (columns, types, constraints)
  • Primary keys, foreign keys, unique constraints
  • Indexes and their purposes
  • Entity Relationship Diagram (ERD)
  • Design patterns and anti-patterns

2. Statistical Analysis

  • Row counts and cardinality
  • Data distributions for key columns
  • Null value percentages
  • Distinct value counts and selectivity
  • Statistical summaries (min/max/avg)
  • Anomaly detection (duplicates, outliers, skew)
  • Statistical Significance Testing :
    • Normality tests (Shapiro-Wilk, Anderson-Darling)
    • Correlation analysis (Pearson, Spearman) with confidence intervals
    • Chi-square tests for categorical associations
    • Outlier detection with statistical tests
    • Group comparisons (t-test, Mann-Whitney U)
    • All tests report p-values and effect sizes

3. Semantic Analysis

  • Business domain identification (e.g., e-commerce, healthcare)
  • Entity type classification (master vs transactional)
  • Business rules and constraints
  • Entity lifecycles and state machines
  • Domain terminology glossary

4. Query Analysis

  • Index coverage and efficiency
  • Missing index identification
  • Composite index opportunities
  • Join performance analysis
  • Query pattern identification
  • Optimization recommendations with expected improvements
  • Performance Baseline Measurement :
    • Actual query execution times (not just EXPLAIN)
    • Primary key lookups with timing
    • Table scan performance
    • Index range scan efficiency
    • JOIN query benchmarks
    • Aggregation query performance
    • Efficiency scoring (EXPLAIN vs actual time comparison)

5. Security Analysis

  • Sensitive Data Identification:
    • PII: names, emails, phone numbers, SSN, addresses
    • Credentials: passwords, API keys, tokens
    • Financial data: credit cards, bank accounts
    • Health data: medical records
  • Access Pattern Analysis:
    • Overly permissive schemas
    • Missing row-level security
  • Vulnerability Assessment:
    • SQL injection vectors
    • Weak authentication patterns
    • Missing encryption indicators
  • Compliance Assessment:
    • GDPR indicators (personal data)
    • PCI-DSS indicators (payment data)
    • Data retention patterns
  • Data Classification:
    • PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED

6. Meta Analysis

  • Report quality assessment by section (depth, completeness)
  • Gap identification (what was missed)
  • Prompt improvement suggestions for future runs
  • Evolution history tracking

7. Question Catalogs

  • 90+ Answerable Questions across all agents (minimum 15-20 per agent)
  • Executable Answer Plans for each question using MCP tools
  • Question Templates with structured answer formats
  • 15+ Cross-Domain Questions requiring multiple agents (enhanced in v1.3)
  • Complexity Ratings (LOW/MEDIUM/HIGH) with time estimates

Each agent generates a catalog of questions they can answer about the database, with step-by-step plans for how to answer each question using MCP tools. This creates a reusable knowledge base for future LLM interactions.

Cross-Domain Categories (v1.3):

  • Performance + Security (4 questions)
  • Structure + Semantics (3 questions)
  • Statistics + Query (3 questions)
  • Security + Semantics (3 questions)
  • All Agents (2 questions)

Output Format

The generated report includes:

# COMPREHENSIVE DATABASE DISCOVERY REPORT

## Executive Summary
- Database identity (system type, purpose, scale)
- Critical findings (top 5 - one from each agent)
- Health score: current X/10 → potential Y/10
- Top 5 recommendations (prioritized)

## 1. STRUCTURAL ANALYSIS
- Schema inventory
- Relationship diagram
- Design patterns
- Issues & recommendations

## 2. STATISTICAL ANALYSIS
- Table profiles
- Data quality score
- Distribution profiles
- Anomalies detected

## 3. SEMANTIC ANALYSIS
- Business domain identification
- Entity catalog
- Business rules inference
- Domain glossary

## 4. QUERY ANALYSIS
- Index coverage assessment
- Query pattern analysis
- Optimization opportunities
- Expected improvements

## 5. SECURITY ANALYSIS
- Sensitive data identification
- Access pattern analysis
- Vulnerability assessment
- Compliance indicators
- Security recommendations

## 6. CRITICAL FINDINGS
- Each with: description, impact quantification, root cause, remediation

## 7. RECOMMENDATIONS ROADMAP
- URGENT: [actions with impact/effort]
- HIGH: [actions]
- MODERATE: [actions]
- Expected timeline with metrics

## Appendices
- A. Table DDL
- B. Query examples with EXPLAIN
- C. Statistical distributions
- D. Business glossary
- E. Security data classification

Additionally, a separate META ANALYSIS document is generated with:

  • Section quality ratings (depth, completeness)
  • Specific prompt improvement suggestions
  • Gap identification
  • Evolution history

Question Catalogs

In addition to the analysis reports, each agent generates a Question Catalog - a knowledge base of questions the agent can answer about the database, with executable plans for how to answer each question.

What Are Question Catalogs?

A Question Catalog contains:

  • 90+ questions across all agents (minimum 15-20 per agent)
  • Executable answer plans using specific MCP tools
  • Answer templates with structured output formats
  • Complexity ratings (LOW/MEDIUM/HIGH)
  • Time estimates for answering each question

Question Catalog Structure

# {AGENT} QUESTION CATALOG

## Metadata
- Agent: {STRUCTURAL|STATISTICAL|SEMANTIC|QUERY|SECURITY}
- Database: {database_name}
- Questions Generated: {count}

## Questions by Category

### Category 1: {Category Name}

#### Q1. {Question Template}
**Question Type:** factual|analytical|comparative|predictive|recommendation

**Example Questions:**
- "What tables exist in the database?"
- "What columns does table X have?"

**Answer Plan:**
1. Step 1: Use `list_tables` to get all tables
2. Step 2: Use `describe_table` to get column details
3. Output: Structured list with table names and column details

**Answer Template:**
Based on the schema analysis:
- Table 1: {columns}
- Table 2: {columns}

Question Catalog Examples

STRUCTURAL Agent Questions

  • "What tables exist in the database?"
  • "How are tables X and Y related?"
  • "What indexes exist on table X?"
  • "What constraints are defined on table X?"

STATISTICAL Agent Questions

  • "How many rows does table X have?"
  • "What is the distribution of values in column X?"
  • "Are there any outliers in column X?"
  • "What percentage of values are null in column X?"

SEMANTIC Agent Questions

  • "What type of system is this database for?"
  • "What does table X represent?"
  • "What business rules are enforced?"
  • "What does term X mean in this domain?"

QUERY Agent Questions

  • "Why is query X slow?"
  • "What indexes would improve query X?"
  • "How can I optimize query X?"
  • "What is the most efficient join path?"

SECURITY Agent Questions

  • "What sensitive data exists in table X?"
  • "Where is PII stored?"
  • "What security vulnerabilities exist?"
  • "Does this database comply with GDPR?"

Cross-Domain Questions (META Agent)

15+ minimum questions across 5 categories:

Performance + Security (4 questions):

  • "What are the security implications of query performance issues?"
  • "Which slow queries expose the most sensitive data?"
  • "Can query optimization create security vulnerabilities?"
  • "What is the performance impact of security measures?"

Structure + Semantics (3 questions):

  • "How does the schema design support or hinder business workflows?"
  • "What business rules are enforced (or missing) in the schema constraints?"
  • "Which tables represent core business entities vs. supporting data?"

Statistics + Query (3 questions):

  • "Which data distributions are causing query performance issues?"
  • "How would data deduplication affect index efficiency?"
  • "What is the statistical significance of query performance variations?"

Security + Semantics (3 questions):

  • "What business processes involve sensitive data exposure risks?"
  • "Which business entities require enhanced security measures?"
  • "How do business rules affect data access patterns?"

All Agents (2 questions):

  • "What is the overall database health score across all dimensions?"
  • "Which business-critical workflows have the highest technical debt?"

Using Question Catalogs

Question catalogs enable:

  1. Fast Answers: Pre-validated plans skip analysis phase
  2. Consistent Quality: All answers follow proven templates
  3. Tool Reuse: Efficient MCP tool usage patterns
  4. Comprehensive Coverage: 90+ questions cover most user needs

Example workflow:

# User asks: "What sensitive data exists in the customers table?"

# System retrieves from SECURITY question catalog:
# - Question template: "What sensitive data exists in table X?"
# - Answer plan: sample_rows + column_profile on customers
# - Answer template: Structured list with sensitivity classification

# System executes plan and returns formatted answer

Minimum Questions Per Agent

Agent Minimum Questions High-Complexity Target
STRUCTURAL 20 5
STATISTICAL 20 5
SEMANTIC 15 3
QUERY 20 5
SECURITY 15 5
TOTAL 90+ 23+

Stored In Catalog

All question catalogs are stored in the MCP catalog for easy retrieval:

  • kind="question_catalog", key="structural_questions"
  • kind="question_catalog", key="statistical_questions"
  • kind="question_catalog", key="semantic_questions"
  • kind="question_catalog", key="query_questions"
  • kind="question_catalog", key="security_questions"
  • kind="question_catalog", key="cross_domain_questions"

Command-Line Options

Option Short Description Default
--database -d Database name to discover First available
--schema -s Schema name to analyze All schemas
--output -o Output file path discovery_YYYYMMDD_HHMMSS.md
--timeout -t Timeout in seconds 300
--verbose -v Enable verbose output Disabled
--help -h Show help message -

System Prompts

The discovery uses the system prompt in prompts/multi_agent_discovery_prompt.md:

  • prompts/multi_agent_discovery_prompt.md - Concise system prompt for actual use
  • prompts/multi_agent_discovery_reference.md - Comprehensive reference documentation

Examples

CI/CD Integration

# .github/workflows/database-discovery.yml
name: Database Discovery

on:
  schedule:
    - cron: '0 0 * * 0'  # Weekly
  workflow_dispatch:

jobs:
  discovery:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Claude Code
        run: npm install -g @anthropics/claude-code
      - name: Run Discovery
        env:
          PROXYSQL_MCP_ENDPOINT: ${{ secrets.PROXYSQL_MCP_ENDPOINT }}
          PROXYSQL_MCP_TOKEN: ${{ secrets.PROXYSQL_MCP_TOKEN }}
        run: |
          cd scripts/mcp/DiscoveryAgent/ClaudeCode_Headless
          python ./headless_db_discovery.py \
            --database production \
            --output discovery_$(date +%Y%m%d).md          
      - name: Upload Report
        uses: actions/upload-artifact@v3
        with:
          name: discovery-report
          path: discovery_*.md

Monitoring Automation

#!/bin/bash
# weekly_discovery.sh - Run weekly and compare results

REPORT_DIR="/var/db-discovery/reports"
mkdir -p "$REPORT_DIR"

# Run discovery
python ./headless_db_discovery.py \
  --database mydb \
  --output "$REPORT_DIR/discovery_$(date +%Y%m%d).md"

# Compare with previous week
PREV=$(ls -t "$REPORT_DIR"/discovery_*.md | head -2 | tail -1)
if [ -f "$PREV" ]; then
  echo "=== Changes since last discovery ==="
  diff "$PREV" "$REPORT_DIR/discovery_$(date +%Y%m%d).md" || true
fi

Custom Discovery Focus

# Modify the prompt in the script for focused discovery
def build_discovery_prompt(database: Optional[str]) -> str:
    prompt = f"""Using the 4-agent discovery protocol, focus on:
    1. Security aspects of {database}
    2. Performance optimization opportunities
    3. Data quality issues

    Follow the standard 4-round protocol but prioritize these areas.
    """
    return prompt

Troubleshooting

"Claude Code executable not found"

Set the CLAUDE_PATH environment variable:

export CLAUDE_PATH="/path/to/claude"
python ./headless_db_discovery.py

Or install Claude Code:

npm install -g @anthropics/claude-code

"No MCP servers available"

Ensure MCP servers are configured in your Claude Code settings or provide MCP configuration via command line.

Discovery times out

Increase the timeout:

python ./headless_db_discovery.py --timeout 600

Output is truncated

The multi-agent prompt is designed for comprehensive output. If truncated:

  1. Increase timeout
  2. Check MCP server connection stability
  3. Review MCP catalog for partial results

Directory Structure

ClaudeCode_Headless/
├── README.md                           # This file
├── prompts/
│   ├── multi_agent_discovery_prompt.md      # Concise system prompt
│   └── multi_agent_discovery_reference.md   # Comprehensive reference
├── headless_db_discovery.py            # Python script
├── headless_db_discovery.sh            # Bash script
└── examples/
    ├── DATABASE_DISCOVERY_REPORT.md        # Example output
    └── DATABASE_QUESTION_CAPABILITIES.md   # Feature documentation

License

Same license as the proxysql-vec project.