proxysql/doc/multi_agent_database_discov...

# Multi-Agent Database Discovery System

## Overview

This document describes a multi-agent database discovery system implemented using Claude Code's autonomous agent capabilities. The system uses 4 specialized subagents that collaborate via the MCP (Model Context Protocol) catalog to perform comprehensive database analysis.

## Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                     Main Agent (Orchestrator)                       │
│  - Launches 4 specialized subagents in parallel                     │
│  - Coordinates via MCP catalog                                      │
│  - Synthesizes final report                                        │
└────────────────┬────────────────────────────────────────────────────┘
                 │
    ┌────────────┼────────────┬────────────┬────────────┐
    │            │            │            │            │
    ▼            ▼            ▼            ▼            ▼
┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐
│Struct. │  │Statist.│  │Semantic│  │Query   │  │  MCP   │
│ Agent  │  │ Agent  │  │ Agent  │  │ Agent  │  │Catalog │
└────────┘  └────────┘  └────────┘  └────────┘  └────────┘
     │            │            │            │            │
     └────────────┴────────────┴────────────┴────────────┘
                          │
                   ▼              ▼
              ┌─────────┐  ┌─────────────┐
              │ Database│  │   Catalog   │
              │ (testdb)│  │ (Shared Mem)│
              └─────────┘  └─────────────┘
```

## The Four Discovery Agents

### 1. Structural Agent
**Mission**: Map tables, relationships, indexes, and constraints

**Responsibilities**:
- Complete ERD documentation
- Table schema analysis (columns, types, constraints)
- Foreign key relationship mapping
- Index inventory and assessment
- Architectural pattern identification

**Catalog Entries**: `structural_discovery`

**Key Deliverables**:
- Entity Relationship Diagram
- Complete table definitions
- Index inventory with recommendations
- Relationship cardinality mapping

### 2. Statistical Agent
**Mission**: Profile data distributions, patterns, and anomalies

**Responsibilities**:
- Table row counts and cardinality analysis
- Data distribution profiling
- Anomaly detection (duplicates, outliers)
- Statistical summaries (min/max/avg/stddev)
- Business metrics calculation

**Catalog Entries**: `statistical_discovery`

**Key Deliverables**:
- Data quality score
- Duplicate detection reports
- Statistical distributions
- True vs inflated metrics

### 3. Semantic Agent
**Mission**: Infer business domain and entity types

**Responsibilities**:
- Business domain identification
- Entity type classification (master vs transactional)
- Business rule discovery
- Entity lifecycle analysis
- State machine identification

**Catalog Entries**: `semantic_discovery`

**Key Deliverables**:
- Complete domain model
- Business rules documentation
- Entity lifecycle definitions
- Missing capabilities identification

### 4. Query Agent
**Mission**: Analyze access patterns and optimization opportunities

**Responsibilities**:
- Query pattern identification
- Index usage analysis
- Performance bottleneck detection
- N+1 query risk assessment
- Optimization recommendations

**Catalog Entries**: `query_discovery`

**Key Deliverables**:
- Access pattern analysis
- Index recommendations (prioritized)
- Query optimization strategies
- EXPLAIN analysis results

## Discovery Process

### Round Structure

Each agent runs 4 rounds of analysis:

#### Round 1: Blind Exploration
- Initial schema/data analysis
- First observations cataloged
- Initial hypotheses formed

#### Round 2: Pattern Recognition
- Read other agents' findings from catalog
- Identify patterns and anomalies
- Form and test hypotheses

#### Round 3: Hypothesis Testing
- Validate business rules against actual data
- Cross-reference findings with other agents
- Confirm or reject hypotheses

#### Round 4: Final Synthesis
- Compile comprehensive findings
- Generate actionable recommendations
- Create final mission summary

### Catalog-Based Collaboration

```python
# Agent writes findings
catalog_upsert(
    kind="structural_discovery",
    key="table_customers",
    document="...",
    tags="structural,table,schema"
)

# Agent reads other agents' findings
findings = catalog_list(kind="statistical_discovery")
```

## Example Discovery Output

### Database: testdb (E-commerce Order Management)

#### True Statistics (After Deduplication)
| Metric | Current | Actual |
|--------|---------|--------|
| Customers | 15 | 5 |
| Products | 15 | 5 |
| Orders | 15 | 5 |
| Order Items | 27 | 9 |
| Revenue | $10,886.67 | $3,628.85 |

#### Critical Findings
1. **Data Quality**: 5/100 (Catastrophic) - 67% data triplication
2. **Missing Index**: orders.order_date (P0 critical)
3. **Missing Constraints**: No UNIQUE or FK constraints
4. **Business Domain**: E-commerce order management system

## Launching the Discovery System

```python
# In Claude Code, launch 4 agents in parallel:
Task(
    description="Structural Discovery",
    prompt=STRUCTURAL_AGENT_PROMPT,
    subagent_type="general-purpose"
)

Task(
    description="Statistical Discovery",
    prompt=STATISTICAL_AGENT_PROMPT,
    subagent_type="general-purpose"
)

Task(
    description="Semantic Discovery",
    prompt=SEMANTIC_AGENT_PROMPT,
    subagent_type="general-purpose"
)

Task(
    description="Query Discovery",
    prompt=QUERY_AGENT_PROMPT,
    subagent_type="general-purpose"
)
```

## MCP Tools Used

The agents use these MCP tools for database analysis:

- `list_schemas` - List all databases
- `list_tables` - List tables in a schema
- `describe_table` - Get table schema
- `sample_rows` - Get sample data from table
- `column_profile` - Get column statistics
- `run_sql_readonly` - Execute read-only queries
- `catalog_upsert` - Store findings in catalog
- `catalog_list` / `catalog_get` - Retrieve findings from catalog

### Target Scoping Requirement

Discovery and catalog/LLM tools are target-scoped. Always pass `target_id`:

- `discovery.run_static(target_id=..., schema_filter=...)`
- `catalog.*(target_id=..., run_id=...)`
- `agent.run_start(target_id=..., run_id=...)`
- `llm.*(target_id=..., run_id=...)`

`run_id` resolution is no longer global. The same schema name can exist on multiple targets, so `target_id` is required to resolve the correct discovery run.

## Benefits of Multi-Agent Approach

1. **Parallel Execution**: All 4 agents run simultaneously
2. **Specialized Expertise**: Each agent focuses on its domain
3. **Cross-Validation**: Agents validate each other's findings
4. **Comprehensive Coverage**: All aspects of database analyzed
5. **Knowledge Synthesis**: Final report combines all perspectives

## Output Format

The system produces:

1. **40+ Catalog Entries** - Detailed findings organized by agent
2. **Comprehensive Report** - Executive summary with:
   - Structure & Schema (ERD, table definitions)
   - Business Domain (entity model, business rules)
   - Key Insights (data quality, performance)
   - Data Quality Assessment (score, recommendations)

## Future Enhancements

- [ ] Additional specialized agents (Security, Performance, Compliance)
- [ ] Automated remediation scripts
- [ ] Continuous monitoring mode
- [ ] Integration with CI/CD pipelines
- [ ] Web-based dashboard for findings

## Related Files

- `simple_discovery.py` - Simplified demo of multi-agent pattern
- `mcp_catalog.db` - Catalog database for storing findings

## References

- Claude Code Task Tool Documentation
- MCP (Model Context Protocol) Specification
- ProxySQL MCP Server Implementation