29 KiB
Database Discovery Agent Architecture (Conceptual Design)
Overview
This document describes a conceptual architecture for an AI-powered database discovery agent that could autonomously explore, understand, and analyze any database schema regardless of complexity or domain. The agent would use a mixture-of-experts approach where specialized LLM agents collaborate to build comprehensive understanding of database structures, data patterns, and business semantics.
Note: This is a conceptual design document. The actual ProxySQL MCP implementation uses a different approach based on the two-phase discovery architecture described in Two_Phase_Discovery_Implementation.md.
Core Principles
- Domain Agnostic - No assumptions about what the database contains; everything is discovered
- Iterative Exploration - Not a one-time schema dump; continuous learning through multiple cycles
- Collaborative Intelligence - Multiple experts with different perspectives work together
- Hypothesis-Driven - Experts form hypotheses, test them, and refine understanding
- Confidence-Based - Exploration continues until a confidence threshold is reached
High-Level Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR AGENT │
│ - Manages exploration state │
│ - Coordinates expert agents │
│ - Synthesizes findings │
│ - Decides when exploration is complete │
└─────────────────────────────────────────────────────────────────────┘
│
├─────────────────────────────────────┐
│ │
▼─────────────────▼ ▼─────────────────▼
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
│ STRUCTURAL EXPERT │ │ STATISTICAL EXPERT │ │ SEMANTIC EXPERT │
│ │ │ │ │ │
│ - Schemas & tables │ │ - Data distributions │ │ - Business meaning │
│ - Relationships │ │ - Patterns & trends │ │ - Domain concepts │
│ - Constraints │ │ - Outliers & anomalies │ │ - Entity types │
│ - Indexes & keys │ │ - Correlations │ │ - User intent │
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘
│ │ │
└───────────────────────────┼───────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ SHARED CATALOG │
│ (SQLite + MCP) │
│ │
│ Expert discoveries │
│ Cross-expert notes │
│ Exploration state │
│ Hypotheses & results │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ MCP Query Endpoint │
│ - Database access │
│ - Catalog operations │
│ - All tools available │
└─────────────────────────────────┘
Expert Specializations
1. Structural Expert
Focus: Database topology and relationships
Responsibilities:
- Map all schemas, tables, and their relationships
- Identify primary keys, foreign keys, and constraints
- Analyze index patterns and access structures
- Detect table hierarchies and dependencies
- Identify structural patterns (star schema, snowflake, hierarchical, etc.)
Exploration Strategy:
class StructuralExpert:
def explore(self, catalog):
# Iteration 1: Map the territory
tables = self.list_all_tables()
for table in tables:
schema = self.get_table_schema(table)
relationships = self.find_relationships(table)
catalog.save("structure", f"table.{table}", {
"columns": schema["columns"],
"primary_key": schema["pk"],
"foreign_keys": relationships,
"indexes": schema["indexes"]
})
# Iteration 2: Find connection points
for table_a, table_b in potential_pairs:
joins = self.suggest_joins(table_a, table_b)
if joins:
catalog.save("relationship", f"{table_a}↔{table_b}", joins)
# Iteration 3: Identify structural patterns
patterns = self.identify_patterns(catalog)
# "This looks like a star schema", "Hierarchical structure", etc.
Output Examples:
- "Found 47 tables across 3 schemas"
- "customers table has 1:many relationship with orders via customer_id"
- "Detected star schema: fact_orders with dims: customers, products, time"
- "Table hierarchy: categories → subcategories → products"
2. Statistical Expert
Focus: Data characteristics and patterns
Responsibilities:
- Profile data distributions for all columns
- Identify correlations between fields
- Detect outliers and anomalies
- Find temporal patterns and trends
- Calculate data quality metrics
Exploration Strategy:
class StatisticalExpert:
def explore(self, catalog):
# Read structural discoveries first
tables = catalog.get_kind("table.*")
for table in tables:
# Profile each column
for col in table["columns"]:
stats = self.get_column_stats(table, col)
catalog.save("statistics", f"{table}.{col}", {
"distinct_count": stats["distinct"],
"null_percentage": stats["null_pct"],
"distribution": stats["histogram"],
"top_values": stats["top_20"],
"numeric_range": stats["min_max"] if numeric else None,
"anomalies": stats["outliers"]
})
# Find correlations
correlations = self.find_correlations(tables)
catalog.save("patterns", "correlations", correlations)
Output Examples:
- "orders.status has 4 values: pending (23%), confirmed (45%), shipped (28%), cancelled (4%)"
- "Strong correlation (0.87) between order_items.quantity and order_total"
- "Outlier detected: customer_age has values > 150 (likely data error)"
- "Temporal pattern: 80% of orders placed M-F, 9am-5pm"
3. Semantic Expert
Focus: Business meaning and domain understanding
Responsibilities:
- Infer business domain from data patterns
- Identify entity types and their roles
- Interpret relationships in business terms
- Understand user intent and use cases
- Document business rules and constraints
Exploration Strategy:
class SemanticExpert:
def explore(self, catalog):
# Synthesize findings from other experts
structure = catalog.get_kind("structure.*")
stats = catalog.get_kind("statistics.*")
for table in structure:
# Infer domain from table name, columns, and data
domain = self.infer_domain(table, stats)
# "This is an ecommerce database"
# Understand entities
entity_type = self.identify_entity(table)
# "customers table = Customer entities"
# Understand relationships
for rel in catalog.get_relationships(table):
business_rel = self.interpret_relationship(rel)
# "customer has many orders"
catalog.save("semantic", f"rel.{table}.{other}", {
"relationship": business_rel,
"cardinality": "one-to-many",
"business_rule": "A customer can place multiple orders"
})
# Identify business processes
processes = self.infer_processes(structure, stats)
# "Order fulfillment flow: orders → order_items → products"
catalog.save("semantic", "processes", processes)
Output Examples:
- "Domain inference: E-commerce platform (B2C)"
- "Entity: customers represents individual shoppers, not businesses"
- "Business process: Order lifecycle = pending → confirmed → shipped → delivered"
- "Business rule: Customer cannot be deleted if they have active orders"
4. Query Expert
Focus: Efficient data access patterns
Responsibilities:
- Analyze query optimization opportunities
- Recommend index usage strategies
- Determine optimal join orders
- Design sampling strategies for exploration
- Identify performance bottlenecks
Exploration Strategy:
class QueryExpert:
def explore(self, catalog):
# Analyze query patterns from structural expert
structure = catalog.get_kind("structure.*")
for table in structure:
# Suggest optimal access patterns
access_patterns = self.analyze_access_patterns(table)
catalog.save("query", f"access.{table}", {
"best_index": access_patterns["optimal_index"],
"join_order": access_patterns["optimal_join_order"],
"sampling_strategy": access_patterns["sample_method"]
})
Output Examples:
- "For customers table, use idx_email for lookups, idx_created_at for time ranges"
- "Join order: customers → orders → order_items (not reverse)"
- "Sample strategy: Use TABLESAMPLE for large tables, LIMIT 1000 for small"
Orchestrator: The Conductor
The Orchestrator agent coordinates all experts and manages the overall discovery process.
class DiscoveryOrchestrator:
"""Coordinates the collaborative discovery process"""
def __init__(self, mcp_endpoint):
self.mcp = MCPClient(mcp_endpoint)
self.catalog = CatalogClient(self.mcp)
self.experts = [
StructuralExpert(self.catalog),
StatisticalExpert(self.catalog),
SemanticExpert(self.catalog),
QueryExpert(self.catalog)
]
self.state = {
"iteration": 0,
"phase": "initial",
"confidence": 0.0,
"coverage": 0.0, # % of database explored
"expert_contributions": {e.name: 0 for e in self.experts}
}
def discover(self, max_iterations=50, target_confidence=0.95):
"""Main discovery loop"""
while self.state["iteration"] < max_iterations:
self.state["iteration"] += 1
# 1. ASSESS: What's the current state?
assessment = self.assess_progress()
# 2. PLAN: Which expert should work on what?
tasks = self.plan_next_tasks(assessment)
# Example: [
# {"expert": "structural", "task": "explore_orders_table", "priority": 0.8},
# {"expert": "semantic", "task": "interpret_customer_entity", "priority": 0.7},
# {"expert": "statistical", "task": "analyze_price_distribution", "priority": 0.6}
# ]
# 3. EXECUTE: Experts work in parallel
results = self.execute_tasks_parallel(tasks)
# 4. SYNTHESIZE: Combine findings
synthesis = self.synthesize_findings(results)
# 5. COLLABORATE: Experts share insights
self.facilitate_collaboration(synthesis)
# 6. REFLECT: Are we done?
self.update_state(synthesis)
if self.should_stop():
break
# 7. FINALIZE: Create comprehensive understanding
return self.create_final_report()
def plan_next_tasks(self, assessment):
"""Decide what each expert should do next"""
prompt = f"""
You are orchestrating database discovery. Current state:
{assessment}
Expert findings:
{self.format_expert_findings()}
Plan the next exploration tasks. Consider:
1. Which expert can contribute most valuable insights now?
2. What areas need more exploration?
3. Which expert findings should be verified or extended?
Output JSON array of tasks, each with:
- expert: which expert should do it
- task: what they should do
- priority: 0-1 (higher = more important)
- dependencies: [array of catalog keys this depends on]
"""
return self.llm_call(prompt)
def facilitate_collaboration(self, synthesis):
"""Experts exchange notes and build on each other's work"""
# Find points where experts should collaborate
collaborations = self.find_collaboration_opportunities(synthesis)
for collab in collaborations:
# Example: Structural found relationship, Semantic should interpret it
prompt = f"""
EXPERT COLLABORATION:
{collab['expert_a']} found: {collab['finding_a']}
{collab['expert_b']}: Please interpret this finding from your perspective.
Consider: How does this affect your understanding? What follow-up is needed?
Catalog context: {self.get_relevant_context(collab)}
"""
response = self.llm_call(prompt, expert=collab['expert_b'])
self.catalog.save("collaboration", collab['id'], response)
def create_final_report(self):
"""Synthesize all discoveries into comprehensive understanding"""
prompt = f"""
Create a comprehensive database understanding report from all expert findings.
Include:
1. Executive Summary
2. Database Structure Overview
3. Business Domain Analysis
4. Key Insights & Patterns
5. Data Quality Assessment
6. Usage Recommendations
Catalog data:
{self.catalog.export_all()}
"""
return self.llm_call(prompt)
Discovery Phases
Phase 1: Blind Exploration (Iterations 1-10)
Characteristics:
- All experts work independently on basic discovery
- No domain assumptions
- Systematic data collection
- Build foundational knowledge
Expert Activities:
- Structural: Map all tables, columns, relationships, constraints
- Statistical: Profile all columns, find distributions, cardinality
- Semantic: Identify entity types from naming patterns, infer basic domain
- Query: Analyze access patterns, identify indexes
Output:
- Complete table inventory
- Column profiles for all fields
- Basic relationship mapping
- Initial domain hypothesis
Phase 2: Pattern Recognition (Iterations 11-30)
Characteristics:
- Experts begin collaborating
- Patterns emerge from data
- Domain becomes clearer
- Hypotheses form
Expert Activities:
- Structural: Identifies structural patterns (star schema, hierarchies)
- Statistical: Finds correlations, temporal patterns, outliers
- Semantic: Interprets relationships in business terms
- Query: Optimizes based on discovered patterns
Example Collaboration:
Structural → Catalog: "Found customers→orders relationship (customer_id)"
Semantic reads: "This indicates customers place orders (ecommerce)"
Statistical reads: "Analyzing order patterns by customer..."
Query: "Optimizing customer-centric queries using customer_id index"
Output:
- Domain identification (e.g., "This is an ecommerce database")
- Business entity definitions
- Relationship interpretations
- Pattern documentation
Phase 3: Hypothesis-Driven Exploration (Iterations 31-45)
Characteristics:
- Experts form and test hypotheses
- Deep dives into specific areas
- Validation of assumptions
- Filling knowledge gaps
Example Hypotheses:
- "This is a SaaS metrics database" → Test for subscription patterns
- "There are seasonal trends in orders" → Analyze temporal distributions
- "Data quality issues in customer emails" → Validate email formats
- "Unused indexes exist" → Check index usage statistics
Expert Activities:
- All experts design experiments to test hypotheses
- Catalog stores hypothesis results (confirmed/refined/refuted)
- Collaboration to refine understanding based on evidence
Output:
- Validated business insights
- Refined domain understanding
- Data quality assessment
- Performance optimization recommendations
Phase 4: Synthesis & Validation (Iterations 46-50)
Characteristics:
- All experts collaborate to validate findings
- Resolve contradictions
- Fill remaining gaps
- Create unified understanding
Expert Activities:
- Cross-expert validation of key findings
- Synthesis of comprehensive understanding
- Documentation of uncertainties
- Recommendations for further analysis
Output:
- Final comprehensive report
- Confidence scores for each finding
- Remaining uncertainties
- Actionable recommendations
Domain-Agnostic Discovery Examples
Example 1: Law Firm Database
Phase 1-5 (Blind):
Structural: "Found: cases, clients, attorneys, documents, time_entries, billing_rates"
Statistical: "time_entries has 1.2M rows, highly skewed distribution, 15% null values"
Semantic: "Entity types: Cases (legal matters), Clients (people/companies), Attorneys"
Query: "Best access path: case_id → time_entries (indexed)"
Phase 6-15 (Patterns):
Collaboration:
Structural → Semantic: "cases have many-to-many with attorneys (case_attorneys table)"
Semantic: "Multiple attorneys per case = legal teams"
Statistical: "time_entries correlate with case_stage progression (r=0.72)"
Query: "Filter by case_date_first for time range queries (30% faster)"
Domain Inference:
Semantic: "Legal practice management system"
Structural: "Found invoices, payments tables - confirms practice management"
Statistical: "Billing patterns: hourly rates, contingency fees detected"
Phase 16-30 (Hypotheses):
Hypothesis: "Firm specializes in specific case types"
→ Statistical: "Analyze case_type distribution"
→ Found: "70% personal_injury, 20% corporate_litigation, 10% family_law"
Hypothesis: "Document workflow exists"
→ Structural: "Found document_versions, approvals, court_filings tables"
→ Semantic: "Document approval workflow for court submissions"
Hypothesis: "Attorney productivity varies by case type"
→ Statistical: "Analyze time_entries per attorney per case_type"
→ Found: "Personal injury cases require 3.2x more attorney hours"
Phase 31-40 (Synthesis):
Final Understanding:
"Mid-sized personal injury law firm (50-100 attorneys)
with practice management system including:
- Case management with document workflows
- Time tracking and billing (hourly + contingency)
- 70% focus on personal injury cases
- Average case duration: 18 months
- Key metrics: case duration, settlement amounts,
attorney productivity, document approval cycle time"
Example 2: Scientific Research Database
Phase 1-5 (Blind):
Structural: "experiments, samples, measurements, researchers, publications, protocols"
Statistical: "High precision numeric data (10 decimal places), temporal patterns in experiments"
Semantic: "Research lab data management system"
Query: "Measurements table largest (45M rows), needs partitioning"
Phase 6-15 (Patterns):
Domain: "Biology/medicine research (gene_sequences, drug_compounds detected)"
Patterns: "Experiments follow protocol → samples → measurements → analysis pipeline"
Structural: "Linear workflow: protocols → experiments → samples → measurements → analysis → publications"
Statistical: "High correlation between protocol_type and measurement_outcome"
Phase 16-30 (Hypotheses):
Hypothesis: "Longitudinal study design"
→ Structural: "Found repeated_measurements, time_points tables"
→ Confirmed: "Same subjects measured over time"
Hypothesis: "Control groups present"
→ Statistical: "Found clustering in measurements (treatment vs control)"
→ Confirmed: "Experimental design includes control groups"
Hypothesis: "Statistical significance testing"
→ Statistical: "Found p_value distributions, confidence intervals in results"
→ Confirmed: "Clinical trial data with statistical validation"
Phase 31-40 (Synthesis):
Final Understanding:
"Clinical trial data management system for pharmaceutical research
- Drug compound testing with control/treatment groups
- Longitudinal design (repeated measurements over time)
- Statistical validation pipeline
- Regulatory reporting (publication tracking)
- Sample tracking from collection to analysis"
Example 3: E-commerce Database
Phase 1-5 (Blind):
Structural: "customers, orders, order_items, products, categories, inventory, reviews"
Statistical: "orders has 5.4M rows, steady growth trend, seasonal patterns"
Semantic: "Online retail platform"
Query: "orders table requires date-based partitioning"
Phase 6-15 (Patterns):
Domain: "B2C ecommerce platform"
Relationships: "customers → orders (1:N), orders → order_items (1:N), order_items → products (N:1)"
Business flow: "Browse → Add to Cart → Checkout → Payment → Fulfillment"
Statistical: "Order value distribution: Long tail, $50 median, $280 mean"
Phase 16-30 (Hypotheses):
Hypothesis: "Customer segments exist"
→ Statistical: "Cluster customers by order frequency, total spend, recency"
→ Found: "3 segments: Casual (70%), Regular (25%), VIP (5%)"
Hypothesis: "Product categories affect return rates"
→ Statistical: "analyze returns by category"
→ Found: "Clothing: 12% return rate, Electronics: 3% return rate"
Hypothesis: "Seasonal buying patterns"
→ Statistical: "Time series analysis of orders by month/day/week"
→ Found: "Peak: Nov-Dec (holidays), Dip: Jan, Slow: Feb-Mar"
Phase 31-40 (Synthesis):
Final Understanding:
"Consumer ecommerce platform with:
- 5.4M orders, steady growth, strong seasonality
- 3 customer segments (Casual/Regular/VIP) with different behaviors
- 15% overall return rate (varies by category)
- Peak season: Nov-Dec (4.3x normal volume)
- Key metrics: conversion rate, AOV, customer lifetime value, return rate"
Catalog Schema
The catalog serves as shared memory for all experts. Key entry types:
Structure Entries
{
"kind": "structure",
"key": "table.customers",
"document": {
"columns": ["customer_id", "name", "email", "created_at"],
"primary_key": "customer_id",
"foreign_keys": [{"column": "region_id", "references": "regions(id)"}],
"row_count": 125000
},
"tags": "customers,table"
}
Statistics Entries
{
"kind": "statistics",
"key": "customers.created_at",
"document": {
"distinct_count": 118500,
"null_percentage": 0.0,
"min": "2020-01-15",
"max": "2025-01-10",
"distribution": "uniform_growth"
},
"tags": "customers,created_at,temporal"
}
Semantic Entries
{
"kind": "semantic",
"key": "entity.customers",
"document": {
"entity_type": "Customer",
"definition": "Individual shoppers who place orders",
"business_role": "Revenue generator",
"lifecycle": "Registered → Active → Inactive → Churned"
},
"tags": "semantic,entity,customers"
}
Relationship Entries
{
"kind": "relationship",
"key": "customers↔orders",
"document": {
"type": "one_to_many",
"join_key": "customer_id",
"business_meaning": "Customers place multiple orders",
"cardinality_estimates": {
"min_orders_per_customer": 1,
"max_orders_per_customer": 247,
"avg_orders_per_customer": 4.3
}
},
"tags": "relationship,customers,orders"
}
Hypothesis Entries
{
"kind": "hypothesis",
"key": "vip_segment_behavior",
"document": {
"hypothesis": "VIP customers have higher order frequency and AOV",
"status": "confirmed",
"confidence": 0.92,
"evidence": [
"VIP avg 12.4 orders/year vs 2.1 for regular",
"VIP avg AOV $156 vs $45 for regular"
]
},
"tags": "hypothesis,customer_segments,confirmed"
}
Collaboration Entries
{
"kind": "collaboration",
"key": "semantic_interpretation_001",
"document": {
"trigger": "Structural expert found orders.status enum",
"expert": "semantic",
"interpretation": "Order lifecycle: pending → confirmed → shipped → delivered",
"follow_up_tasks": ["Analyze time_in_status durations", "Find bottleneck status"]
},
"tags": "collaboration,structural,semantic,order_lifecycle"
}
Stopping Criteria
The orchestrator evaluates whether to continue exploration based on:
- Confidence Threshold - Overall confidence in understanding exceeds target (e.g., 0.95)
- Coverage Threshold - Sufficient percentage of database explored (e.g., 95% of tables analyzed)
- Diminishing Returns - Last N iterations produced minimal new insights
- Resource Limits - Maximum iterations reached or time budget exceeded
- Expert Consensus - All experts indicate satisfactory understanding
def should_stop(self):
# High confidence in core understanding
if self.state["confidence"] >= 0.95:
return True, "Confidence threshold reached"
# Good coverage of database
if self.state["coverage"] >= 0.95:
return True, "Coverage threshold reached"
# Diminishing returns
if self.state["recent_insights"] < 2:
self.state["diminishing_returns"] += 1
if self.state["diminishing_returns"] >= 3:
return True, "Diminishing returns"
# Expert consensus
if all(expert.satisfied() for expert in self.experts):
return True, "Expert consensus achieved"
return False, "Continue exploration"
Implementation Considerations
Scalability
For large databases (hundreds/thousands of tables):
- Parallel Exploration: Experts work simultaneously on different table subsets
- Incremental Coverage: Prioritize important tables (many relationships, high cardinality)
- Smart Sampling: Use statistical sampling instead of full scans for large tables
- Progressive Refinement: Start with overview, drill down iteratively
Performance
- Caching: Cache catalog queries to avoid repeated reads
- Batch Operations: Group multiple tool calls when possible
- Index-Aware: Let Query Expert guide exploration to use indexed columns
- Connection Pooling: Reuse database connections (already implemented in MCP)
Error Handling
- Graceful Degradation: If one expert fails, others continue
- Retry Logic: Transient errors trigger retries with backoff
- Partial Results: Catalog stores partial findings if interrupted
- Validation: Experts cross-validate each other's findings
Extensibility
- Pluggable Experts: New expert types can be added easily
- Domain-Specific Experts: Specialized experts for healthcare, finance, etc.
- Custom Tools: Additional MCP tools for specific analysis needs
- Expert Configuration: Experts can be configured/enabled based on needs
Usage Example
from discovery_agent import DiscoveryOrchestrator
# Initialize agent
agent = DiscoveryOrchestrator(
mcp_endpoint="https://localhost:6071/mcp/query",
auth_token="your_token"
)
# Run discovery
report = agent.discover(
max_iterations=50,
target_confidence=0.95
)
# Access findings
print(report["summary"])
print(report["domain"])
print(report["key_insights"])
# Query catalog for specific information
customers_analysis = agent.catalog.search("customers")
relationships = agent.catalog.get_kind("relationship")
Related Documentation
- Architecture.md - Overall MCP architecture
- README.md - Module overview and setup
- VARIABLES.md - Configuration variables reference
Version History
- 1.0 (2025-01-12) - Initial architecture design
Implementation Status
Status: Conceptual design - Not implemented Actual Implementation: See for the actual ProxySQL MCP discovery implementation.
Version
- Last Updated: 2026-01-19