# Database Discovery Agent Architecture (Conceptual Design) ## Overview This document describes a conceptual architecture for an AI-powered database discovery agent that could autonomously explore, understand, and analyze any database schema regardless of complexity or domain. The agent would use a mixture-of-experts approach where specialized LLM agents collaborate to build comprehensive understanding of database structures, data patterns, and business semantics. **Note:** This is a conceptual design document. The actual ProxySQL MCP implementation uses a different approach based on the two-phase discovery architecture described in `Two_Phase_Discovery_Implementation.md`. ## Core Principles 1. **Domain Agnostic** - No assumptions about what the database contains; everything is discovered 2. **Iterative Exploration** - Not a one-time schema dump; continuous learning through multiple cycles 3. **Collaborative Intelligence** - Multiple experts with different perspectives work together 4. **Hypothesis-Driven** - Experts form hypotheses, test them, and refine understanding 5. **Confidence-Based** - Exploration continues until a confidence threshold is reached ## High-Level Architecture ``` ┌─────────────────────────────────────────────────────────────────────┐ │ ORCHESTRATOR AGENT │ │ - Manages exploration state │ │ - Coordinates expert agents │ │ - Synthesizes findings │ │ - Decides when exploration is complete │ └─────────────────────────────────────────────────────────────────────┘ │ ├─────────────────────────────────────┐ │ │ ▼─────────────────▼ ▼─────────────────▼ ┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐ │ STRUCTURAL EXPERT │ │ STATISTICAL EXPERT │ │ SEMANTIC EXPERT │ │ │ │ │ │ │ │ - Schemas & tables │ │ - Data distributions │ │ - Business meaning │ │ - Relationships │ │ - Patterns & trends │ │ - Domain concepts │ │ - Constraints │ │ - Outliers & anomalies │ │ - Entity types │ │ - Indexes & keys │ │ - Correlations │ │ - User intent │ └─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘ │ │ │ └───────────────────────────┼───────────────────────────┘ │ ▼ ┌─────────────────────────────────┐ │ SHARED CATALOG │ │ (SQLite + MCP) │ │ │ │ Expert discoveries │ │ Cross-expert notes │ │ Exploration state │ │ Hypotheses & results │ └─────────────────────────────────┘ │ ▼ ┌─────────────────────────────────┐ │ MCP Query Endpoint │ │ - Database access │ │ - Catalog operations │ │ - All tools available │ └─────────────────────────────────┘ ``` ## Expert Specializations ### 1. Structural Expert **Focus:** Database topology and relationships **Responsibilities:** - Map all schemas, tables, and their relationships - Identify primary keys, foreign keys, and constraints - Analyze index patterns and access structures - Detect table hierarchies and dependencies - Identify structural patterns (star schema, snowflake, hierarchical, etc.) **Exploration Strategy:** ```python class StructuralExpert: def explore(self, catalog): # Iteration 1: Map the territory tables = self.list_all_tables() for table in tables: schema = self.get_table_schema(table) relationships = self.find_relationships(table) catalog.save("structure", f"table.{table}", { "columns": schema["columns"], "primary_key": schema["pk"], "foreign_keys": relationships, "indexes": schema["indexes"] }) # Iteration 2: Find connection points for table_a, table_b in potential_pairs: joins = self.suggest_joins(table_a, table_b) if joins: catalog.save("relationship", f"{table_a}↔{table_b}", joins) # Iteration 3: Identify structural patterns patterns = self.identify_patterns(catalog) # "This looks like a star schema", "Hierarchical structure", etc. ``` **Output Examples:** - "Found 47 tables across 3 schemas" - "customers table has 1:many relationship with orders via customer_id" - "Detected star schema: fact_orders with dims: customers, products, time" - "Table hierarchy: categories → subcategories → products" ### 2. Statistical Expert **Focus:** Data characteristics and patterns **Responsibilities:** - Profile data distributions for all columns - Identify correlations between fields - Detect outliers and anomalies - Find temporal patterns and trends - Calculate data quality metrics **Exploration Strategy:** ```python class StatisticalExpert: def explore(self, catalog): # Read structural discoveries first tables = catalog.get_kind("table.*") for table in tables: # Profile each column for col in table["columns"]: stats = self.get_column_stats(table, col) catalog.save("statistics", f"{table}.{col}", { "distinct_count": stats["distinct"], "null_percentage": stats["null_pct"], "distribution": stats["histogram"], "top_values": stats["top_20"], "numeric_range": stats["min_max"] if numeric else None, "anomalies": stats["outliers"] }) # Find correlations correlations = self.find_correlations(tables) catalog.save("patterns", "correlations", correlations) ``` **Output Examples:** - "orders.status has 4 values: pending (23%), confirmed (45%), shipped (28%), cancelled (4%)" - "Strong correlation (0.87) between order_items.quantity and order_total" - "Outlier detected: customer_age has values > 150 (likely data error)" - "Temporal pattern: 80% of orders placed M-F, 9am-5pm" ### 3. Semantic Expert **Focus:** Business meaning and domain understanding **Responsibilities:** - Infer business domain from data patterns - Identify entity types and their roles - Interpret relationships in business terms - Understand user intent and use cases - Document business rules and constraints **Exploration Strategy:** ```python class SemanticExpert: def explore(self, catalog): # Synthesize findings from other experts structure = catalog.get_kind("structure.*") stats = catalog.get_kind("statistics.*") for table in structure: # Infer domain from table name, columns, and data domain = self.infer_domain(table, stats) # "This is an ecommerce database" # Understand entities entity_type = self.identify_entity(table) # "customers table = Customer entities" # Understand relationships for rel in catalog.get_relationships(table): business_rel = self.interpret_relationship(rel) # "customer has many orders" catalog.save("semantic", f"rel.{table}.{other}", { "relationship": business_rel, "cardinality": "one-to-many", "business_rule": "A customer can place multiple orders" }) # Identify business processes processes = self.infer_processes(structure, stats) # "Order fulfillment flow: orders → order_items → products" catalog.save("semantic", "processes", processes) ``` **Output Examples:** - "Domain inference: E-commerce platform (B2C)" - "Entity: customers represents individual shoppers, not businesses" - "Business process: Order lifecycle = pending → confirmed → shipped → delivered" - "Business rule: Customer cannot be deleted if they have active orders" ### 4. Query Expert **Focus:** Efficient data access patterns **Responsibilities:** - Analyze query optimization opportunities - Recommend index usage strategies - Determine optimal join orders - Design sampling strategies for exploration - Identify performance bottlenecks **Exploration Strategy:** ```python class QueryExpert: def explore(self, catalog): # Analyze query patterns from structural expert structure = catalog.get_kind("structure.*") for table in structure: # Suggest optimal access patterns access_patterns = self.analyze_access_patterns(table) catalog.save("query", f"access.{table}", { "best_index": access_patterns["optimal_index"], "join_order": access_patterns["optimal_join_order"], "sampling_strategy": access_patterns["sample_method"] }) ``` **Output Examples:** - "For customers table, use idx_email for lookups, idx_created_at for time ranges" - "Join order: customers → orders → order_items (not reverse)" - "Sample strategy: Use TABLESAMPLE for large tables, LIMIT 1000 for small" ## Orchestrator: The Conductor The Orchestrator agent coordinates all experts and manages the overall discovery process. ```python class DiscoveryOrchestrator: """Coordinates the collaborative discovery process""" def __init__(self, mcp_endpoint): self.mcp = MCPClient(mcp_endpoint) self.catalog = CatalogClient(self.mcp) self.experts = [ StructuralExpert(self.catalog), StatisticalExpert(self.catalog), SemanticExpert(self.catalog), QueryExpert(self.catalog) ] self.state = { "iteration": 0, "phase": "initial", "confidence": 0.0, "coverage": 0.0, # % of database explored "expert_contributions": {e.name: 0 for e in self.experts} } def discover(self, max_iterations=50, target_confidence=0.95): """Main discovery loop""" while self.state["iteration"] < max_iterations: self.state["iteration"] += 1 # 1. ASSESS: What's the current state? assessment = self.assess_progress() # 2. PLAN: Which expert should work on what? tasks = self.plan_next_tasks(assessment) # Example: [ # {"expert": "structural", "task": "explore_orders_table", "priority": 0.8}, # {"expert": "semantic", "task": "interpret_customer_entity", "priority": 0.7}, # {"expert": "statistical", "task": "analyze_price_distribution", "priority": 0.6} # ] # 3. EXECUTE: Experts work in parallel results = self.execute_tasks_parallel(tasks) # 4. SYNTHESIZE: Combine findings synthesis = self.synthesize_findings(results) # 5. COLLABORATE: Experts share insights self.facilitate_collaboration(synthesis) # 6. REFLECT: Are we done? self.update_state(synthesis) if self.should_stop(): break # 7. FINALIZE: Create comprehensive understanding return self.create_final_report() def plan_next_tasks(self, assessment): """Decide what each expert should do next""" prompt = f""" You are orchestrating database discovery. Current state: {assessment} Expert findings: {self.format_expert_findings()} Plan the next exploration tasks. Consider: 1. Which expert can contribute most valuable insights now? 2. What areas need more exploration? 3. Which expert findings should be verified or extended? Output JSON array of tasks, each with: - expert: which expert should do it - task: what they should do - priority: 0-1 (higher = more important) - dependencies: [array of catalog keys this depends on] """ return self.llm_call(prompt) def facilitate_collaboration(self, synthesis): """Experts exchange notes and build on each other's work""" # Find points where experts should collaborate collaborations = self.find_collaboration_opportunities(synthesis) for collab in collaborations: # Example: Structural found relationship, Semantic should interpret it prompt = f""" EXPERT COLLABORATION: {collab['expert_a']} found: {collab['finding_a']} {collab['expert_b']}: Please interpret this finding from your perspective. Consider: How does this affect your understanding? What follow-up is needed? Catalog context: {self.get_relevant_context(collab)} """ response = self.llm_call(prompt, expert=collab['expert_b']) self.catalog.save("collaboration", collab['id'], response) def create_final_report(self): """Synthesize all discoveries into comprehensive understanding""" prompt = f""" Create a comprehensive database understanding report from all expert findings. Include: 1. Executive Summary 2. Database Structure Overview 3. Business Domain Analysis 4. Key Insights & Patterns 5. Data Quality Assessment 6. Usage Recommendations Catalog data: {self.catalog.export_all()} """ return self.llm_call(prompt) ``` ## Discovery Phases ### Phase 1: Blind Exploration (Iterations 1-10) **Characteristics:** - All experts work independently on basic discovery - No domain assumptions - Systematic data collection - Build foundational knowledge **Expert Activities:** - **Structural**: Map all tables, columns, relationships, constraints - **Statistical**: Profile all columns, find distributions, cardinality - **Semantic**: Identify entity types from naming patterns, infer basic domain - **Query**: Analyze access patterns, identify indexes **Output:** - Complete table inventory - Column profiles for all fields - Basic relationship mapping - Initial domain hypothesis ### Phase 2: Pattern Recognition (Iterations 11-30) **Characteristics:** - Experts begin collaborating - Patterns emerge from data - Domain becomes clearer - Hypotheses form **Expert Activities:** - **Structural**: Identifies structural patterns (star schema, hierarchies) - **Statistical**: Finds correlations, temporal patterns, outliers - **Semantic**: Interprets relationships in business terms - **Query**: Optimizes based on discovered patterns **Example Collaboration:** ``` Structural → Catalog: "Found customers→orders relationship (customer_id)" Semantic reads: "This indicates customers place orders (ecommerce)" Statistical reads: "Analyzing order patterns by customer..." Query: "Optimizing customer-centric queries using customer_id index" ``` **Output:** - Domain identification (e.g., "This is an ecommerce database") - Business entity definitions - Relationship interpretations - Pattern documentation ### Phase 3: Hypothesis-Driven Exploration (Iterations 31-45) **Characteristics:** - Experts form and test hypotheses - Deep dives into specific areas - Validation of assumptions - Filling knowledge gaps **Example Hypotheses:** - "This is a SaaS metrics database" → Test for subscription patterns - "There are seasonal trends in orders" → Analyze temporal distributions - "Data quality issues in customer emails" → Validate email formats - "Unused indexes exist" → Check index usage statistics **Expert Activities:** - All experts design experiments to test hypotheses - Catalog stores hypothesis results (confirmed/refined/refuted) - Collaboration to refine understanding based on evidence **Output:** - Validated business insights - Refined domain understanding - Data quality assessment - Performance optimization recommendations ### Phase 4: Synthesis & Validation (Iterations 46-50) **Characteristics:** - All experts collaborate to validate findings - Resolve contradictions - Fill remaining gaps - Create unified understanding **Expert Activities:** - Cross-expert validation of key findings - Synthesis of comprehensive understanding - Documentation of uncertainties - Recommendations for further analysis **Output:** - Final comprehensive report - Confidence scores for each finding - Remaining uncertainties - Actionable recommendations ## Domain-Agnostic Discovery Examples ### Example 1: Law Firm Database **Phase 1-5 (Blind):** ``` Structural: "Found: cases, clients, attorneys, documents, time_entries, billing_rates" Statistical: "time_entries has 1.2M rows, highly skewed distribution, 15% null values" Semantic: "Entity types: Cases (legal matters), Clients (people/companies), Attorneys" Query: "Best access path: case_id → time_entries (indexed)" ``` **Phase 6-15 (Patterns):** ``` Collaboration: Structural → Semantic: "cases have many-to-many with attorneys (case_attorneys table)" Semantic: "Multiple attorneys per case = legal teams" Statistical: "time_entries correlate with case_stage progression (r=0.72)" Query: "Filter by case_date_first for time range queries (30% faster)" Domain Inference: Semantic: "Legal practice management system" Structural: "Found invoices, payments tables - confirms practice management" Statistical: "Billing patterns: hourly rates, contingency fees detected" ``` **Phase 16-30 (Hypotheses):** ``` Hypothesis: "Firm specializes in specific case types" → Statistical: "Analyze case_type distribution" → Found: "70% personal_injury, 20% corporate_litigation, 10% family_law" Hypothesis: "Document workflow exists" → Structural: "Found document_versions, approvals, court_filings tables" → Semantic: "Document approval workflow for court submissions" Hypothesis: "Attorney productivity varies by case type" → Statistical: "Analyze time_entries per attorney per case_type" → Found: "Personal injury cases require 3.2x more attorney hours" ``` **Phase 31-40 (Synthesis):** ``` Final Understanding: "Mid-sized personal injury law firm (50-100 attorneys) with practice management system including: - Case management with document workflows - Time tracking and billing (hourly + contingency) - 70% focus on personal injury cases - Average case duration: 18 months - Key metrics: case duration, settlement amounts, attorney productivity, document approval cycle time" ``` ### Example 2: Scientific Research Database **Phase 1-5 (Blind):** ``` Structural: "experiments, samples, measurements, researchers, publications, protocols" Statistical: "High precision numeric data (10 decimal places), temporal patterns in experiments" Semantic: "Research lab data management system" Query: "Measurements table largest (45M rows), needs partitioning" ``` **Phase 6-15 (Patterns):** ``` Domain: "Biology/medicine research (gene_sequences, drug_compounds detected)" Patterns: "Experiments follow protocol → samples → measurements → analysis pipeline" Structural: "Linear workflow: protocols → experiments → samples → measurements → analysis → publications" Statistical: "High correlation between protocol_type and measurement_outcome" ``` **Phase 16-30 (Hypotheses):** ``` Hypothesis: "Longitudinal study design" → Structural: "Found repeated_measurements, time_points tables" → Confirmed: "Same subjects measured over time" Hypothesis: "Control groups present" → Statistical: "Found clustering in measurements (treatment vs control)" → Confirmed: "Experimental design includes control groups" Hypothesis: "Statistical significance testing" → Statistical: "Found p_value distributions, confidence intervals in results" → Confirmed: "Clinical trial data with statistical validation" ``` **Phase 31-40 (Synthesis):** ``` Final Understanding: "Clinical trial data management system for pharmaceutical research - Drug compound testing with control/treatment groups - Longitudinal design (repeated measurements over time) - Statistical validation pipeline - Regulatory reporting (publication tracking) - Sample tracking from collection to analysis" ``` ### Example 3: E-commerce Database **Phase 1-5 (Blind):** ``` Structural: "customers, orders, order_items, products, categories, inventory, reviews" Statistical: "orders has 5.4M rows, steady growth trend, seasonal patterns" Semantic: "Online retail platform" Query: "orders table requires date-based partitioning" ``` **Phase 6-15 (Patterns):** ``` Domain: "B2C ecommerce platform" Relationships: "customers → orders (1:N), orders → order_items (1:N), order_items → products (N:1)" Business flow: "Browse → Add to Cart → Checkout → Payment → Fulfillment" Statistical: "Order value distribution: Long tail, $50 median, $280 mean" ``` **Phase 16-30 (Hypotheses):** ``` Hypothesis: "Customer segments exist" → Statistical: "Cluster customers by order frequency, total spend, recency" → Found: "3 segments: Casual (70%), Regular (25%), VIP (5%)" Hypothesis: "Product categories affect return rates" → Statistical: "analyze returns by category" → Found: "Clothing: 12% return rate, Electronics: 3% return rate" Hypothesis: "Seasonal buying patterns" → Statistical: "Time series analysis of orders by month/day/week" → Found: "Peak: Nov-Dec (holidays), Dip: Jan, Slow: Feb-Mar" ``` **Phase 31-40 (Synthesis):** ``` Final Understanding: "Consumer ecommerce platform with: - 5.4M orders, steady growth, strong seasonality - 3 customer segments (Casual/Regular/VIP) with different behaviors - 15% overall return rate (varies by category) - Peak season: Nov-Dec (4.3x normal volume) - Key metrics: conversion rate, AOV, customer lifetime value, return rate" ``` ## Catalog Schema The catalog serves as shared memory for all experts. Key entry types: ### Structure Entries ```json { "kind": "structure", "key": "table.customers", "document": { "columns": ["customer_id", "name", "email", "created_at"], "primary_key": "customer_id", "foreign_keys": [{"column": "region_id", "references": "regions(id)"}], "row_count": 125000 }, "tags": "customers,table" } ``` ### Statistics Entries ```json { "kind": "statistics", "key": "customers.created_at", "document": { "distinct_count": 118500, "null_percentage": 0.0, "min": "2020-01-15", "max": "2025-01-10", "distribution": "uniform_growth" }, "tags": "customers,created_at,temporal" } ``` ### Semantic Entries ```json { "kind": "semantic", "key": "entity.customers", "document": { "entity_type": "Customer", "definition": "Individual shoppers who place orders", "business_role": "Revenue generator", "lifecycle": "Registered → Active → Inactive → Churned" }, "tags": "semantic,entity,customers" } ``` ### Relationship Entries ```json { "kind": "relationship", "key": "customers↔orders", "document": { "type": "one_to_many", "join_key": "customer_id", "business_meaning": "Customers place multiple orders", "cardinality_estimates": { "min_orders_per_customer": 1, "max_orders_per_customer": 247, "avg_orders_per_customer": 4.3 } }, "tags": "relationship,customers,orders" } ``` ### Hypothesis Entries ```json { "kind": "hypothesis", "key": "vip_segment_behavior", "document": { "hypothesis": "VIP customers have higher order frequency and AOV", "status": "confirmed", "confidence": 0.92, "evidence": [ "VIP avg 12.4 orders/year vs 2.1 for regular", "VIP avg AOV $156 vs $45 for regular" ] }, "tags": "hypothesis,customer_segments,confirmed" } ``` ### Collaboration Entries ```json { "kind": "collaboration", "key": "semantic_interpretation_001", "document": { "trigger": "Structural expert found orders.status enum", "expert": "semantic", "interpretation": "Order lifecycle: pending → confirmed → shipped → delivered", "follow_up_tasks": ["Analyze time_in_status durations", "Find bottleneck status"] }, "tags": "collaboration,structural,semantic,order_lifecycle" } ``` ## Stopping Criteria The orchestrator evaluates whether to continue exploration based on: 1. **Confidence Threshold** - Overall confidence in understanding exceeds target (e.g., 0.95) 2. **Coverage Threshold** - Sufficient percentage of database explored (e.g., 95% of tables analyzed) 3. **Diminishing Returns** - Last N iterations produced minimal new insights 4. **Resource Limits** - Maximum iterations reached or time budget exceeded 5. **Expert Consensus** - All experts indicate satisfactory understanding ```python def should_stop(self): # High confidence in core understanding if self.state["confidence"] >= 0.95: return True, "Confidence threshold reached" # Good coverage of database if self.state["coverage"] >= 0.95: return True, "Coverage threshold reached" # Diminishing returns if self.state["recent_insights"] < 2: self.state["diminishing_returns"] += 1 if self.state["diminishing_returns"] >= 3: return True, "Diminishing returns" # Expert consensus if all(expert.satisfied() for expert in self.experts): return True, "Expert consensus achieved" return False, "Continue exploration" ``` ## Implementation Considerations ### Scalability For large databases (hundreds/thousands of tables): - **Parallel Exploration**: Experts work simultaneously on different table subsets - **Incremental Coverage**: Prioritize important tables (many relationships, high cardinality) - **Smart Sampling**: Use statistical sampling instead of full scans for large tables - **Progressive Refinement**: Start with overview, drill down iteratively ### Performance - **Caching**: Cache catalog queries to avoid repeated reads - **Batch Operations**: Group multiple tool calls when possible - **Index-Aware**: Let Query Expert guide exploration to use indexed columns - **Connection Pooling**: Reuse database connections (already implemented in MCP) ### Error Handling - **Graceful Degradation**: If one expert fails, others continue - **Retry Logic**: Transient errors trigger retries with backoff - **Partial Results**: Catalog stores partial findings if interrupted - **Validation**: Experts cross-validate each other's findings ### Extensibility - **Pluggable Experts**: New expert types can be added easily - **Domain-Specific Experts**: Specialized experts for healthcare, finance, etc. - **Custom Tools**: Additional MCP tools for specific analysis needs - **Expert Configuration**: Experts can be configured/enabled based on needs ## Usage Example ```python from discovery_agent import DiscoveryOrchestrator # Initialize agent agent = DiscoveryOrchestrator( mcp_endpoint="https://localhost:6071/mcp/query", auth_token="your_token" ) # Run discovery report = agent.discover( max_iterations=50, target_confidence=0.95 ) # Access findings print(report["summary"]) print(report["domain"]) print(report["key_insights"]) # Query catalog for specific information customers_analysis = agent.catalog.search("customers") relationships = agent.catalog.get_kind("relationship") ``` ## Related Documentation - [Architecture.md](Architecture.md) - Overall MCP architecture - [README.md](README.md) - Module overview and setup - [VARIABLES.md](VARIABLES.md) - Configuration variables reference ## Version History - **1.0** (2025-01-12) - Initial architecture design ## Implementation Status **Status:** Conceptual design - Not implemented **Actual Implementation:** See for the actual ProxySQL MCP discovery implementation. ## Version - **Last Updated:** 2026-01-19