You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/doc/MCP/Database_Discovery_Agent.md

29 KiB

Database Discovery Agent Architecture (Conceptual Design)

Overview

This document describes a conceptual architecture for an AI-powered database discovery agent that could autonomously explore, understand, and analyze any database schema regardless of complexity or domain. The agent would use a mixture-of-experts approach where specialized LLM agents collaborate to build comprehensive understanding of database structures, data patterns, and business semantics.

Note: This is a conceptual design document. The actual ProxySQL MCP implementation uses a different approach based on the two-phase discovery architecture described in Two_Phase_Discovery_Implementation.md.

Core Principles

  1. Domain Agnostic - No assumptions about what the database contains; everything is discovered
  2. Iterative Exploration - Not a one-time schema dump; continuous learning through multiple cycles
  3. Collaborative Intelligence - Multiple experts with different perspectives work together
  4. Hypothesis-Driven - Experts form hypotheses, test them, and refine understanding
  5. Confidence-Based - Exploration continues until a confidence threshold is reached

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                      ORCHESTRATOR AGENT                             │
│  - Manages exploration state                                        │
│  - Coordinates expert agents                                        │
│  - Synthesizes findings                                             │
│  - Decides when exploration is complete                             │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ├─────────────────────────────────────┐
                              │                                     │
                    ▼─────────────────▼               ▼─────────────────▼
    ┌─────────────────────────┐   ┌─────────────────────────┐   ┌─────────────────────────┐
    │   STRUCTURAL EXPERT     │   │   STATISTICAL EXPERT    │   │    SEMANTIC EXPERT     │
    │                         │   │                         │   │                         │
    │ - Schemas & tables      │   │ - Data distributions    │   │ - Business meaning     │
    │ - Relationships         │   │ - Patterns & trends     │   │ - Domain concepts      │
    │ - Constraints           │   │ - Outliers & anomalies  │   │ - Entity types         │
    │ - Indexes & keys        │   │ - Correlations          │   │ - User intent          │
    └─────────────────────────┘   └─────────────────────────┘   └─────────────────────────┘
                    │                           │                           │
                    └───────────────────────────┼───────────────────────────┘
                                                │
                                                ▼
                              ┌─────────────────────────────────┐
                              │      SHARED CATALOG             │
                              │      (SQLite + MCP)            │
                              │                                 │
                              │  Expert discoveries             │
                              │  Cross-expert notes            │
                              │  Exploration state             │
                              │  Hypotheses & results          │
                              └─────────────────────────────────┘
                                                │
                                                ▼
                              ┌─────────────────────────────────┐
                              │     MCP Query Endpoint          │
                              │  - Database access             │
                              │  - Catalog operations          │
                              │  - All tools available         │
                              └─────────────────────────────────┘

Expert Specializations

1. Structural Expert

Focus: Database topology and relationships

Responsibilities:

  • Map all schemas, tables, and their relationships
  • Identify primary keys, foreign keys, and constraints
  • Analyze index patterns and access structures
  • Detect table hierarchies and dependencies
  • Identify structural patterns (star schema, snowflake, hierarchical, etc.)

Exploration Strategy:

class StructuralExpert:
    def explore(self, catalog):
        # Iteration 1: Map the territory
        tables = self.list_all_tables()
        for table in tables:
            schema = self.get_table_schema(table)
            relationships = self.find_relationships(table)

            catalog.save("structure", f"table.{table}", {
                "columns": schema["columns"],
                "primary_key": schema["pk"],
                "foreign_keys": relationships,
                "indexes": schema["indexes"]
            })

        # Iteration 2: Find connection points
        for table_a, table_b in potential_pairs:
            joins = self.suggest_joins(table_a, table_b)
            if joins:
                catalog.save("relationship", f"{table_a}{table_b}", joins)

        # Iteration 3: Identify structural patterns
        patterns = self.identify_patterns(catalog)
        # "This looks like a star schema", "Hierarchical structure", etc.

Output Examples:

  • "Found 47 tables across 3 schemas"
  • "customers table has 1:many relationship with orders via customer_id"
  • "Detected star schema: fact_orders with dims: customers, products, time"
  • "Table hierarchy: categories → subcategories → products"

2. Statistical Expert

Focus: Data characteristics and patterns

Responsibilities:

  • Profile data distributions for all columns
  • Identify correlations between fields
  • Detect outliers and anomalies
  • Find temporal patterns and trends
  • Calculate data quality metrics

Exploration Strategy:

class StatisticalExpert:
    def explore(self, catalog):
        # Read structural discoveries first
        tables = catalog.get_kind("table.*")

        for table in tables:
            # Profile each column
            for col in table["columns"]:
                stats = self.get_column_stats(table, col)

                catalog.save("statistics", f"{table}.{col}", {
                    "distinct_count": stats["distinct"],
                    "null_percentage": stats["null_pct"],
                    "distribution": stats["histogram"],
                    "top_values": stats["top_20"],
                    "numeric_range": stats["min_max"] if numeric else None,
                    "anomalies": stats["outliers"]
                })

        # Find correlations
        correlations = self.find_correlations(tables)
        catalog.save("patterns", "correlations", correlations)

Output Examples:

  • "orders.status has 4 values: pending (23%), confirmed (45%), shipped (28%), cancelled (4%)"
  • "Strong correlation (0.87) between order_items.quantity and order_total"
  • "Outlier detected: customer_age has values > 150 (likely data error)"
  • "Temporal pattern: 80% of orders placed M-F, 9am-5pm"

3. Semantic Expert

Focus: Business meaning and domain understanding

Responsibilities:

  • Infer business domain from data patterns
  • Identify entity types and their roles
  • Interpret relationships in business terms
  • Understand user intent and use cases
  • Document business rules and constraints

Exploration Strategy:

class SemanticExpert:
    def explore(self, catalog):
        # Synthesize findings from other experts
        structure = catalog.get_kind("structure.*")
        stats = catalog.get_kind("statistics.*")

        for table in structure:
            # Infer domain from table name, columns, and data
            domain = self.infer_domain(table, stats)
            # "This is an ecommerce database"

            # Understand entities
            entity_type = self.identify_entity(table)
            # "customers table = Customer entities"

            # Understand relationships
            for rel in catalog.get_relationships(table):
                business_rel = self.interpret_relationship(rel)
                # "customer has many orders"
                catalog.save("semantic", f"rel.{table}.{other}", {
                    "relationship": business_rel,
                    "cardinality": "one-to-many",
                    "business_rule": "A customer can place multiple orders"
                })

        # Identify business processes
        processes = self.infer_processes(structure, stats)
        # "Order fulfillment flow: orders → order_items → products"
        catalog.save("semantic", "processes", processes)

Output Examples:

  • "Domain inference: E-commerce platform (B2C)"
  • "Entity: customers represents individual shoppers, not businesses"
  • "Business process: Order lifecycle = pending → confirmed → shipped → delivered"
  • "Business rule: Customer cannot be deleted if they have active orders"

4. Query Expert

Focus: Efficient data access patterns

Responsibilities:

  • Analyze query optimization opportunities
  • Recommend index usage strategies
  • Determine optimal join orders
  • Design sampling strategies for exploration
  • Identify performance bottlenecks

Exploration Strategy:

class QueryExpert:
    def explore(self, catalog):
        # Analyze query patterns from structural expert
        structure = catalog.get_kind("structure.*")

        for table in structure:
            # Suggest optimal access patterns
            access_patterns = self.analyze_access_patterns(table)
            catalog.save("query", f"access.{table}", {
                "best_index": access_patterns["optimal_index"],
                "join_order": access_patterns["optimal_join_order"],
                "sampling_strategy": access_patterns["sample_method"]
            })

Output Examples:

  • "For customers table, use idx_email for lookups, idx_created_at for time ranges"
  • "Join order: customers → orders → order_items (not reverse)"
  • "Sample strategy: Use TABLESAMPLE for large tables, LIMIT 1000 for small"

Orchestrator: The Conductor

The Orchestrator agent coordinates all experts and manages the overall discovery process.

class DiscoveryOrchestrator:
    """Coordinates the collaborative discovery process"""

    def __init__(self, mcp_endpoint):
        self.mcp = MCPClient(mcp_endpoint)
        self.catalog = CatalogClient(self.mcp)

        self.experts = [
            StructuralExpert(self.catalog),
            StatisticalExpert(self.catalog),
            SemanticExpert(self.catalog),
            QueryExpert(self.catalog)
        ]

        self.state = {
            "iteration": 0,
            "phase": "initial",
            "confidence": 0.0,
            "coverage": 0.0,  # % of database explored
            "expert_contributions": {e.name: 0 for e in self.experts}
        }

    def discover(self, max_iterations=50, target_confidence=0.95):
        """Main discovery loop"""

        while self.state["iteration"] < max_iterations:
            self.state["iteration"] += 1

            # 1. ASSESS: What's the current state?
            assessment = self.assess_progress()

            # 2. PLAN: Which expert should work on what?
            tasks = self.plan_next_tasks(assessment)
            # Example: [
            #   {"expert": "structural", "task": "explore_orders_table", "priority": 0.8},
            #   {"expert": "semantic", "task": "interpret_customer_entity", "priority": 0.7},
            #   {"expert": "statistical", "task": "analyze_price_distribution", "priority": 0.6}
            # ]

            # 3. EXECUTE: Experts work in parallel
            results = self.execute_tasks_parallel(tasks)

            # 4. SYNTHESIZE: Combine findings
            synthesis = self.synthesize_findings(results)

            # 5. COLLABORATE: Experts share insights
            self.facilitate_collaboration(synthesis)

            # 6. REFLECT: Are we done?
            self.update_state(synthesis)

            if self.should_stop():
                break

        # 7. FINALIZE: Create comprehensive understanding
        return self.create_final_report()

    def plan_next_tasks(self, assessment):
        """Decide what each expert should do next"""

        prompt = f"""
        You are orchestrating database discovery. Current state:
        {assessment}

        Expert findings:
        {self.format_expert_findings()}

        Plan the next exploration tasks. Consider:
        1. Which expert can contribute most valuable insights now?
        2. What areas need more exploration?
        3. Which expert findings should be verified or extended?

        Output JSON array of tasks, each with:
        - expert: which expert should do it
        - task: what they should do
        - priority: 0-1 (higher = more important)
        - dependencies: [array of catalog keys this depends on]
        """

        return self.llm_call(prompt)

    def facilitate_collaboration(self, synthesis):
        """Experts exchange notes and build on each other's work"""

        # Find points where experts should collaborate
        collaborations = self.find_collaboration_opportunities(synthesis)

        for collab in collaborations:
            # Example: Structural found relationship, Semantic should interpret it
            prompt = f"""
            EXPERT COLLABORATION:

            {collab['expert_a']} found: {collab['finding_a']}

            {collab['expert_b']}: Please interpret this finding from your perspective.
            Consider: How does this affect your understanding? What follow-up is needed?

            Catalog context: {self.get_relevant_context(collab)}
            """

            response = self.llm_call(prompt, expert=collab['expert_b'])
            self.catalog.save("collaboration", collab['id'], response)

    def create_final_report(self):
        """Synthesize all discoveries into comprehensive understanding"""

        prompt = f"""
        Create a comprehensive database understanding report from all expert findings.

        Include:
        1. Executive Summary
        2. Database Structure Overview
        3. Business Domain Analysis
        4. Key Insights & Patterns
        5. Data Quality Assessment
        6. Usage Recommendations

        Catalog data:
        {self.catalog.export_all()}
        """

        return self.llm_call(prompt)

Discovery Phases

Phase 1: Blind Exploration (Iterations 1-10)

Characteristics:

  • All experts work independently on basic discovery
  • No domain assumptions
  • Systematic data collection
  • Build foundational knowledge

Expert Activities:

  • Structural: Map all tables, columns, relationships, constraints
  • Statistical: Profile all columns, find distributions, cardinality
  • Semantic: Identify entity types from naming patterns, infer basic domain
  • Query: Analyze access patterns, identify indexes

Output:

  • Complete table inventory
  • Column profiles for all fields
  • Basic relationship mapping
  • Initial domain hypothesis

Phase 2: Pattern Recognition (Iterations 11-30)

Characteristics:

  • Experts begin collaborating
  • Patterns emerge from data
  • Domain becomes clearer
  • Hypotheses form

Expert Activities:

  • Structural: Identifies structural patterns (star schema, hierarchies)
  • Statistical: Finds correlations, temporal patterns, outliers
  • Semantic: Interprets relationships in business terms
  • Query: Optimizes based on discovered patterns

Example Collaboration:

Structural → Catalog: "Found customers→orders relationship (customer_id)"
Semantic reads: "This indicates customers place orders (ecommerce)"
Statistical reads: "Analyzing order patterns by customer..."
Query: "Optimizing customer-centric queries using customer_id index"

Output:

  • Domain identification (e.g., "This is an ecommerce database")
  • Business entity definitions
  • Relationship interpretations
  • Pattern documentation

Phase 3: Hypothesis-Driven Exploration (Iterations 31-45)

Characteristics:

  • Experts form and test hypotheses
  • Deep dives into specific areas
  • Validation of assumptions
  • Filling knowledge gaps

Example Hypotheses:

  • "This is a SaaS metrics database" → Test for subscription patterns
  • "There are seasonal trends in orders" → Analyze temporal distributions
  • "Data quality issues in customer emails" → Validate email formats
  • "Unused indexes exist" → Check index usage statistics

Expert Activities:

  • All experts design experiments to test hypotheses
  • Catalog stores hypothesis results (confirmed/refined/refuted)
  • Collaboration to refine understanding based on evidence

Output:

  • Validated business insights
  • Refined domain understanding
  • Data quality assessment
  • Performance optimization recommendations

Phase 4: Synthesis & Validation (Iterations 46-50)

Characteristics:

  • All experts collaborate to validate findings
  • Resolve contradictions
  • Fill remaining gaps
  • Create unified understanding

Expert Activities:

  • Cross-expert validation of key findings
  • Synthesis of comprehensive understanding
  • Documentation of uncertainties
  • Recommendations for further analysis

Output:

  • Final comprehensive report
  • Confidence scores for each finding
  • Remaining uncertainties
  • Actionable recommendations

Domain-Agnostic Discovery Examples

Example 1: Law Firm Database

Phase 1-5 (Blind):

Structural: "Found: cases, clients, attorneys, documents, time_entries, billing_rates"
Statistical: "time_entries has 1.2M rows, highly skewed distribution, 15% null values"
Semantic: "Entity types: Cases (legal matters), Clients (people/companies), Attorneys"
Query: "Best access path: case_id → time_entries (indexed)"

Phase 6-15 (Patterns):

Collaboration:
  Structural → Semantic: "cases have many-to-many with attorneys (case_attorneys table)"
  Semantic: "Multiple attorneys per case = legal teams"
  Statistical: "time_entries correlate with case_stage progression (r=0.72)"
  Query: "Filter by case_date_first for time range queries (30% faster)"

Domain Inference:
  Semantic: "Legal practice management system"
  Structural: "Found invoices, payments tables - confirms practice management"
  Statistical: "Billing patterns: hourly rates, contingency fees detected"

Phase 16-30 (Hypotheses):

Hypothesis: "Firm specializes in specific case types"
→ Statistical: "Analyze case_type distribution"
→ Found: "70% personal_injury, 20% corporate_litigation, 10% family_law"

Hypothesis: "Document workflow exists"
→ Structural: "Found document_versions, approvals, court_filings tables"
→ Semantic: "Document approval workflow for court submissions"

Hypothesis: "Attorney productivity varies by case type"
→ Statistical: "Analyze time_entries per attorney per case_type"
→ Found: "Personal injury cases require 3.2x more attorney hours"

Phase 31-40 (Synthesis):

Final Understanding:
"Mid-sized personal injury law firm (50-100 attorneys)
with practice management system including:
- Case management with document workflows
- Time tracking and billing (hourly + contingency)
- 70% focus on personal injury cases
- Average case duration: 18 months
- Key metrics: case duration, settlement amounts,
  attorney productivity, document approval cycle time"

Example 2: Scientific Research Database

Phase 1-5 (Blind):

Structural: "experiments, samples, measurements, researchers, publications, protocols"
Statistical: "High precision numeric data (10 decimal places), temporal patterns in experiments"
Semantic: "Research lab data management system"
Query: "Measurements table largest (45M rows), needs partitioning"

Phase 6-15 (Patterns):

Domain: "Biology/medicine research (gene_sequences, drug_compounds detected)"
Patterns: "Experiments follow protocol → samples → measurements → analysis pipeline"
Structural: "Linear workflow: protocols → experiments → samples → measurements → analysis → publications"
Statistical: "High correlation between protocol_type and measurement_outcome"

Phase 16-30 (Hypotheses):

Hypothesis: "Longitudinal study design"
→ Structural: "Found repeated_measurements, time_points tables"
→ Confirmed: "Same subjects measured over time"

Hypothesis: "Control groups present"
→ Statistical: "Found clustering in measurements (treatment vs control)"
→ Confirmed: "Experimental design includes control groups"

Hypothesis: "Statistical significance testing"
→ Statistical: "Found p_value distributions, confidence intervals in results"
→ Confirmed: "Clinical trial data with statistical validation"

Phase 31-40 (Synthesis):

Final Understanding:
"Clinical trial data management system for pharmaceutical research
- Drug compound testing with control/treatment groups
- Longitudinal design (repeated measurements over time)
- Statistical validation pipeline
- Regulatory reporting (publication tracking)
- Sample tracking from collection to analysis"

Example 3: E-commerce Database

Phase 1-5 (Blind):

Structural: "customers, orders, order_items, products, categories, inventory, reviews"
Statistical: "orders has 5.4M rows, steady growth trend, seasonal patterns"
Semantic: "Online retail platform"
Query: "orders table requires date-based partitioning"

Phase 6-15 (Patterns):

Domain: "B2C ecommerce platform"
Relationships: "customers → orders (1:N), orders → order_items (1:N), order_items → products (N:1)"
Business flow: "Browse → Add to Cart → Checkout → Payment → Fulfillment"
Statistical: "Order value distribution: Long tail, $50 median, $280 mean"

Phase 16-30 (Hypotheses):

Hypothesis: "Customer segments exist"
→ Statistical: "Cluster customers by order frequency, total spend, recency"
→ Found: "3 segments: Casual (70%), Regular (25%), VIP (5%)"

Hypothesis: "Product categories affect return rates"
→ Statistical: "analyze returns by category"
→ Found: "Clothing: 12% return rate, Electronics: 3% return rate"

Hypothesis: "Seasonal buying patterns"
→ Statistical: "Time series analysis of orders by month/day/week"
→ Found: "Peak: Nov-Dec (holidays), Dip: Jan, Slow: Feb-Mar"

Phase 31-40 (Synthesis):

Final Understanding:
"Consumer ecommerce platform with:
- 5.4M orders, steady growth, strong seasonality
- 3 customer segments (Casual/Regular/VIP) with different behaviors
- 15% overall return rate (varies by category)
- Peak season: Nov-Dec (4.3x normal volume)
- Key metrics: conversion rate, AOV, customer lifetime value, return rate"

Catalog Schema

The catalog serves as shared memory for all experts. Key entry types:

Structure Entries

{
  "kind": "structure",
  "key": "table.customers",
  "document": {
    "columns": ["customer_id", "name", "email", "created_at"],
    "primary_key": "customer_id",
    "foreign_keys": [{"column": "region_id", "references": "regions(id)"}],
    "row_count": 125000
  },
  "tags": "customers,table"
}

Statistics Entries

{
  "kind": "statistics",
  "key": "customers.created_at",
  "document": {
    "distinct_count": 118500,
    "null_percentage": 0.0,
    "min": "2020-01-15",
    "max": "2025-01-10",
    "distribution": "uniform_growth"
  },
  "tags": "customers,created_at,temporal"
}

Semantic Entries

{
  "kind": "semantic",
  "key": "entity.customers",
  "document": {
    "entity_type": "Customer",
    "definition": "Individual shoppers who place orders",
    "business_role": "Revenue generator",
    "lifecycle": "Registered → Active → Inactive → Churned"
  },
  "tags": "semantic,entity,customers"
}

Relationship Entries

{
  "kind": "relationship",
  "key": "customers↔orders",
  "document": {
    "type": "one_to_many",
    "join_key": "customer_id",
    "business_meaning": "Customers place multiple orders",
    "cardinality_estimates": {
      "min_orders_per_customer": 1,
      "max_orders_per_customer": 247,
      "avg_orders_per_customer": 4.3
    }
  },
  "tags": "relationship,customers,orders"
}

Hypothesis Entries

{
  "kind": "hypothesis",
  "key": "vip_segment_behavior",
  "document": {
    "hypothesis": "VIP customers have higher order frequency and AOV",
    "status": "confirmed",
    "confidence": 0.92,
    "evidence": [
      "VIP avg 12.4 orders/year vs 2.1 for regular",
      "VIP avg AOV $156 vs $45 for regular"
    ]
  },
  "tags": "hypothesis,customer_segments,confirmed"
}

Collaboration Entries

{
  "kind": "collaboration",
  "key": "semantic_interpretation_001",
  "document": {
    "trigger": "Structural expert found orders.status enum",
    "expert": "semantic",
    "interpretation": "Order lifecycle: pending → confirmed → shipped → delivered",
    "follow_up_tasks": ["Analyze time_in_status durations", "Find bottleneck status"]
  },
  "tags": "collaboration,structural,semantic,order_lifecycle"
}

Stopping Criteria

The orchestrator evaluates whether to continue exploration based on:

  1. Confidence Threshold - Overall confidence in understanding exceeds target (e.g., 0.95)
  2. Coverage Threshold - Sufficient percentage of database explored (e.g., 95% of tables analyzed)
  3. Diminishing Returns - Last N iterations produced minimal new insights
  4. Resource Limits - Maximum iterations reached or time budget exceeded
  5. Expert Consensus - All experts indicate satisfactory understanding
def should_stop(self):
    # High confidence in core understanding
    if self.state["confidence"] >= 0.95:
        return True, "Confidence threshold reached"

    # Good coverage of database
    if self.state["coverage"] >= 0.95:
        return True, "Coverage threshold reached"

    # Diminishing returns
    if self.state["recent_insights"] < 2:
        self.state["diminishing_returns"] += 1
        if self.state["diminishing_returns"] >= 3:
            return True, "Diminishing returns"

    # Expert consensus
    if all(expert.satisfied() for expert in self.experts):
        return True, "Expert consensus achieved"

    return False, "Continue exploration"

Implementation Considerations

Scalability

For large databases (hundreds/thousands of tables):

  • Parallel Exploration: Experts work simultaneously on different table subsets
  • Incremental Coverage: Prioritize important tables (many relationships, high cardinality)
  • Smart Sampling: Use statistical sampling instead of full scans for large tables
  • Progressive Refinement: Start with overview, drill down iteratively

Performance

  • Caching: Cache catalog queries to avoid repeated reads
  • Batch Operations: Group multiple tool calls when possible
  • Index-Aware: Let Query Expert guide exploration to use indexed columns
  • Connection Pooling: Reuse database connections (already implemented in MCP)

Error Handling

  • Graceful Degradation: If one expert fails, others continue
  • Retry Logic: Transient errors trigger retries with backoff
  • Partial Results: Catalog stores partial findings if interrupted
  • Validation: Experts cross-validate each other's findings

Extensibility

  • Pluggable Experts: New expert types can be added easily
  • Domain-Specific Experts: Specialized experts for healthcare, finance, etc.
  • Custom Tools: Additional MCP tools for specific analysis needs
  • Expert Configuration: Experts can be configured/enabled based on needs

Usage Example

from discovery_agent import DiscoveryOrchestrator

# Initialize agent
agent = DiscoveryOrchestrator(
    mcp_endpoint="https://localhost:6071/mcp/query",
    auth_token="your_token"
)

# Run discovery
report = agent.discover(
    max_iterations=50,
    target_confidence=0.95
)

# Access findings
print(report["summary"])
print(report["domain"])
print(report["key_insights"])

# Query catalog for specific information
customers_analysis = agent.catalog.search("customers")
relationships = agent.catalog.get_kind("relationship")

Version History

  • 1.0 (2025-01-12) - Initial architecture design

Implementation Status

Status: Conceptual design - Not implemented Actual Implementation: See for the actual ProxySQL MCP discovery implementation.

Version

  • Last Updated: 2026-01-19