diff --git a/doc/MCP/Database_Discovery_Agent.md b/doc/MCP/Database_Discovery_Agent.md new file mode 100644 index 000000000..58eaf01f0 --- /dev/null +++ b/doc/MCP/Database_Discovery_Agent.md @@ -0,0 +1,800 @@ +# Database Discovery Agent Architecture + +## Overview + +This document describes the architecture for an AI-powered database discovery agent that can autonomously explore, understand, and analyze any database schema regardless of complexity or domain. The agent uses a mixture-of-experts approach where specialized LLM agents collaborate to build comprehensive understanding of database structures, data patterns, and business semantics. + +## Core Principles + +1. **Domain Agnostic** - No assumptions about what the database contains; everything is discovered +2. **Iterative Exploration** - Not a one-time schema dump; continuous learning through multiple cycles +3. **Collaborative Intelligence** - Multiple experts with different perspectives work together +4. **Hypothesis-Driven** - Experts form hypotheses, test them, and refine understanding +5. **Confidence-Based** - Exploration continues until a confidence threshold is reached + +## High-Level Architecture + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ ORCHESTRATOR AGENT │ +│ - Manages exploration state │ +│ - Coordinates expert agents │ +│ - Synthesizes findings │ +│ - Decides when exploration is complete │ +└─────────────────────────────────────────────────────────────────────┘ + │ + ├─────────────────────────────────────┐ + │ │ + ▼─────────────────▼ ▼─────────────────▼ + ┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐ + │ STRUCTURAL EXPERT │ │ STATISTICAL EXPERT │ │ SEMANTIC EXPERT │ + │ │ │ │ │ │ + │ - Schemas & tables │ │ - Data distributions │ │ - Business meaning │ + │ - Relationships │ │ - Patterns & trends │ │ - Domain concepts │ + │ - Constraints │ │ - Outliers & anomalies │ │ - Entity types │ + │ - Indexes & keys │ │ - Correlations │ │ - User intent │ + └─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘ + │ │ │ + └───────────────────────────┼───────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────┐ + │ SHARED CATALOG │ + │ (SQLite + MCP) │ + │ │ + │ Expert discoveries │ + │ Cross-expert notes │ + │ Exploration state │ + │ Hypotheses & results │ + └─────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────┐ + │ MCP Query Endpoint │ + │ - Database access │ + │ - Catalog operations │ + │ - All tools available │ + └─────────────────────────────────┘ +``` + +## Expert Specializations + +### 1. Structural Expert + +**Focus:** Database topology and relationships + +**Responsibilities:** +- Map all schemas, tables, and their relationships +- Identify primary keys, foreign keys, and constraints +- Analyze index patterns and access structures +- Detect table hierarchies and dependencies +- Identify structural patterns (star schema, snowflake, hierarchical, etc.) + +**Exploration Strategy:** +```python +class StructuralExpert: + def explore(self, catalog): + # Iteration 1: Map the territory + tables = self.list_all_tables() + for table in tables: + schema = self.get_table_schema(table) + relationships = self.find_relationships(table) + + catalog.save("structure", f"table.{table}", { + "columns": schema["columns"], + "primary_key": schema["pk"], + "foreign_keys": relationships, + "indexes": schema["indexes"] + }) + + # Iteration 2: Find connection points + for table_a, table_b in potential_pairs: + joins = self.suggest_joins(table_a, table_b) + if joins: + catalog.save("relationship", f"{table_a}↔{table_b}", joins) + + # Iteration 3: Identify structural patterns + patterns = self.identify_patterns(catalog) + # "This looks like a star schema", "Hierarchical structure", etc. +``` + +**Output Examples:** +- "Found 47 tables across 3 schemas" +- "customers table has 1:many relationship with orders via customer_id" +- "Detected star schema: fact_orders with dims: customers, products, time" +- "Table hierarchy: categories → subcategories → products" + +### 2. Statistical Expert + +**Focus:** Data characteristics and patterns + +**Responsibilities:** +- Profile data distributions for all columns +- Identify correlations between fields +- Detect outliers and anomalies +- Find temporal patterns and trends +- Calculate data quality metrics + +**Exploration Strategy:** +```python +class StatisticalExpert: + def explore(self, catalog): + # Read structural discoveries first + tables = catalog.get_kind("table.*") + + for table in tables: + # Profile each column + for col in table["columns"]: + stats = self.get_column_stats(table, col) + + catalog.save("statistics", f"{table}.{col}", { + "distinct_count": stats["distinct"], + "null_percentage": stats["null_pct"], + "distribution": stats["histogram"], + "top_values": stats["top_20"], + "numeric_range": stats["min_max"] if numeric else None, + "anomalies": stats["outliers"] + }) + + # Find correlations + correlations = self.find_correlations(tables) + catalog.save("patterns", "correlations", correlations) +``` + +**Output Examples:** +- "orders.status has 4 values: pending (23%), confirmed (45%), shipped (28%), cancelled (4%)" +- "Strong correlation (0.87) between order_items.quantity and order_total" +- "Outlier detected: customer_age has values > 150 (likely data error)" +- "Temporal pattern: 80% of orders placed M-F, 9am-5pm" + +### 3. Semantic Expert + +**Focus:** Business meaning and domain understanding + +**Responsibilities:** +- Infer business domain from data patterns +- Identify entity types and their roles +- Interpret relationships in business terms +- Understand user intent and use cases +- Document business rules and constraints + +**Exploration Strategy:** +```python +class SemanticExpert: + def explore(self, catalog): + # Synthesize findings from other experts + structure = catalog.get_kind("structure.*") + stats = catalog.get_kind("statistics.*") + + for table in structure: + # Infer domain from table name, columns, and data + domain = self.infer_domain(table, stats) + # "This is an ecommerce database" + + # Understand entities + entity_type = self.identify_entity(table) + # "customers table = Customer entities" + + # Understand relationships + for rel in catalog.get_relationships(table): + business_rel = self.interpret_relationship(rel) + # "customer has many orders" + catalog.save("semantic", f"rel.{table}.{other}", { + "relationship": business_rel, + "cardinality": "one-to-many", + "business_rule": "A customer can place multiple orders" + }) + + # Identify business processes + processes = self.infer_processes(structure, stats) + # "Order fulfillment flow: orders → order_items → products" + catalog.save("semantic", "processes", processes) +``` + +**Output Examples:** +- "Domain inference: E-commerce platform (B2C)" +- "Entity: customers represents individual shoppers, not businesses" +- "Business process: Order lifecycle = pending → confirmed → shipped → delivered" +- "Business rule: Customer cannot be deleted if they have active orders" + +### 4. Query Expert + +**Focus:** Efficient data access patterns + +**Responsibilities:** +- Analyze query optimization opportunities +- Recommend index usage strategies +- Determine optimal join orders +- Design sampling strategies for exploration +- Identify performance bottlenecks + +**Exploration Strategy:** +```python +class QueryExpert: + def explore(self, catalog): + # Analyze query patterns from structural expert + structure = catalog.get_kind("structure.*") + + for table in structure: + # Suggest optimal access patterns + access_patterns = self.analyze_access_patterns(table) + catalog.save("query", f"access.{table}", { + "best_index": access_patterns["optimal_index"], + "join_order": access_patterns["optimal_join_order"], + "sampling_strategy": access_patterns["sample_method"] + }) +``` + +**Output Examples:** +- "For customers table, use idx_email for lookups, idx_created_at for time ranges" +- "Join order: customers → orders → order_items (not reverse)" +- "Sample strategy: Use TABLESAMPLE for large tables, LIMIT 1000 for small" + +## Orchestrator: The Conductor + +The Orchestrator agent coordinates all experts and manages the overall discovery process. + +```python +class DiscoveryOrchestrator: + """Coordinates the collaborative discovery process""" + + def __init__(self, mcp_endpoint): + self.mcp = MCPClient(mcp_endpoint) + self.catalog = CatalogClient(self.mcp) + + self.experts = [ + StructuralExpert(self.catalog), + StatisticalExpert(self.catalog), + SemanticExpert(self.catalog), + QueryExpert(self.catalog) + ] + + self.state = { + "iteration": 0, + "phase": "initial", + "confidence": 0.0, + "coverage": 0.0, # % of database explored + "expert_contributions": {e.name: 0 for e in self.experts} + } + + def discover(self, max_iterations=50, target_confidence=0.95): + """Main discovery loop""" + + while self.state["iteration"] < max_iterations: + self.state["iteration"] += 1 + + # 1. ASSESS: What's the current state? + assessment = self.assess_progress() + + # 2. PLAN: Which expert should work on what? + tasks = self.plan_next_tasks(assessment) + # Example: [ + # {"expert": "structural", "task": "explore_orders_table", "priority": 0.8}, + # {"expert": "semantic", "task": "interpret_customer_entity", "priority": 0.7}, + # {"expert": "statistical", "task": "analyze_price_distribution", "priority": 0.6} + # ] + + # 3. EXECUTE: Experts work in parallel + results = self.execute_tasks_parallel(tasks) + + # 4. SYNTHESIZE: Combine findings + synthesis = self.synthesize_findings(results) + + # 5. COLLABORATE: Experts share insights + self.facilitate_collaboration(synthesis) + + # 6. REFLECT: Are we done? + self.update_state(synthesis) + + if self.should_stop(): + break + + # 7. FINALIZE: Create comprehensive understanding + return self.create_final_report() + + def plan_next_tasks(self, assessment): + """Decide what each expert should do next""" + + prompt = f""" + You are orchestrating database discovery. Current state: + {assessment} + + Expert findings: + {self.format_expert_findings()} + + Plan the next exploration tasks. Consider: + 1. Which expert can contribute most valuable insights now? + 2. What areas need more exploration? + 3. Which expert findings should be verified or extended? + + Output JSON array of tasks, each with: + - expert: which expert should do it + - task: what they should do + - priority: 0-1 (higher = more important) + - dependencies: [array of catalog keys this depends on] + """ + + return self.llm_call(prompt) + + def facilitate_collaboration(self, synthesis): + """Experts exchange notes and build on each other's work""" + + # Find points where experts should collaborate + collaborations = self.find_collaboration_opportunities(synthesis) + + for collab in collaborations: + # Example: Structural found relationship, Semantic should interpret it + prompt = f""" + EXPERT COLLABORATION: + + {collab['expert_a']} found: {collab['finding_a']} + + {collab['expert_b']}: Please interpret this finding from your perspective. + Consider: How does this affect your understanding? What follow-up is needed? + + Catalog context: {self.get_relevant_context(collab)} + """ + + response = self.llm_call(prompt, expert=collab['expert_b']) + self.catalog.save("collaboration", collab['id'], response) + + def create_final_report(self): + """Synthesize all discoveries into comprehensive understanding""" + + prompt = f""" + Create a comprehensive database understanding report from all expert findings. + + Include: + 1. Executive Summary + 2. Database Structure Overview + 3. Business Domain Analysis + 4. Key Insights & Patterns + 5. Data Quality Assessment + 6. Usage Recommendations + + Catalog data: + {self.catalog.export_all()} + """ + + return self.llm_call(prompt) +``` + +## Discovery Phases + +### Phase 1: Blind Exploration (Iterations 1-10) + +**Characteristics:** +- All experts work independently on basic discovery +- No domain assumptions +- Systematic data collection +- Build foundational knowledge + +**Expert Activities:** +- **Structural**: Map all tables, columns, relationships, constraints +- **Statistical**: Profile all columns, find distributions, cardinality +- **Semantic**: Identify entity types from naming patterns, infer basic domain +- **Query**: Analyze access patterns, identify indexes + +**Output:** +- Complete table inventory +- Column profiles for all fields +- Basic relationship mapping +- Initial domain hypothesis + +### Phase 2: Pattern Recognition (Iterations 11-30) + +**Characteristics:** +- Experts begin collaborating +- Patterns emerge from data +- Domain becomes clearer +- Hypotheses form + +**Expert Activities:** +- **Structural**: Identifies structural patterns (star schema, hierarchies) +- **Statistical**: Finds correlations, temporal patterns, outliers +- **Semantic**: Interprets relationships in business terms +- **Query**: Optimizes based on discovered patterns + +**Example Collaboration:** +``` +Structural → Catalog: "Found customers→orders relationship (customer_id)" +Semantic reads: "This indicates customers place orders (ecommerce)" +Statistical reads: "Analyzing order patterns by customer..." +Query: "Optimizing customer-centric queries using customer_id index" +``` + +**Output:** +- Domain identification (e.g., "This is an ecommerce database") +- Business entity definitions +- Relationship interpretations +- Pattern documentation + +### Phase 3: Hypothesis-Driven Exploration (Iterations 31-45) + +**Characteristics:** +- Experts form and test hypotheses +- Deep dives into specific areas +- Validation of assumptions +- Filling knowledge gaps + +**Example Hypotheses:** +- "This is a SaaS metrics database" → Test for subscription patterns +- "There are seasonal trends in orders" → Analyze temporal distributions +- "Data quality issues in customer emails" → Validate email formats +- "Unused indexes exist" → Check index usage statistics + +**Expert Activities:** +- All experts design experiments to test hypotheses +- Catalog stores hypothesis results (confirmed/refined/refuted) +- Collaboration to refine understanding based on evidence + +**Output:** +- Validated business insights +- Refined domain understanding +- Data quality assessment +- Performance optimization recommendations + +### Phase 4: Synthesis & Validation (Iterations 46-50) + +**Characteristics:** +- All experts collaborate to validate findings +- Resolve contradictions +- Fill remaining gaps +- Create unified understanding + +**Expert Activities:** +- Cross-expert validation of key findings +- Synthesis of comprehensive understanding +- Documentation of uncertainties +- Recommendations for further analysis + +**Output:** +- Final comprehensive report +- Confidence scores for each finding +- Remaining uncertainties +- Actionable recommendations + +## Domain-Agnostic Discovery Examples + +### Example 1: Law Firm Database + +**Phase 1-5 (Blind):** +``` +Structural: "Found: cases, clients, attorneys, documents, time_entries, billing_rates" +Statistical: "time_entries has 1.2M rows, highly skewed distribution, 15% null values" +Semantic: "Entity types: Cases (legal matters), Clients (people/companies), Attorneys" +Query: "Best access path: case_id → time_entries (indexed)" +``` + +**Phase 6-15 (Patterns):** +``` +Collaboration: + Structural → Semantic: "cases have many-to-many with attorneys (case_attorneys table)" + Semantic: "Multiple attorneys per case = legal teams" + Statistical: "time_entries correlate with case_stage progression (r=0.72)" + Query: "Filter by case_date_first for time range queries (30% faster)" + +Domain Inference: + Semantic: "Legal practice management system" + Structural: "Found invoices, payments tables - confirms practice management" + Statistical: "Billing patterns: hourly rates, contingency fees detected" +``` + +**Phase 16-30 (Hypotheses):** +``` +Hypothesis: "Firm specializes in specific case types" +→ Statistical: "Analyze case_type distribution" +→ Found: "70% personal_injury, 20% corporate_litigation, 10% family_law" + +Hypothesis: "Document workflow exists" +→ Structural: "Found document_versions, approvals, court_filings tables" +→ Semantic: "Document approval workflow for court submissions" + +Hypothesis: "Attorney productivity varies by case type" +→ Statistical: "Analyze time_entries per attorney per case_type" +→ Found: "Personal injury cases require 3.2x more attorney hours" +``` + +**Phase 31-40 (Synthesis):** +``` +Final Understanding: +"Mid-sized personal injury law firm (50-100 attorneys) +with practice management system including: +- Case management with document workflows +- Time tracking and billing (hourly + contingency) +- 70% focus on personal injury cases +- Average case duration: 18 months +- Key metrics: case duration, settlement amounts, + attorney productivity, document approval cycle time" +``` + +### Example 2: Scientific Research Database + +**Phase 1-5 (Blind):** +``` +Structural: "experiments, samples, measurements, researchers, publications, protocols" +Statistical: "High precision numeric data (10 decimal places), temporal patterns in experiments" +Semantic: "Research lab data management system" +Query: "Measurements table largest (45M rows), needs partitioning" +``` + +**Phase 6-15 (Patterns):** +``` +Domain: "Biology/medicine research (gene_sequences, drug_compounds detected)" +Patterns: "Experiments follow protocol → samples → measurements → analysis pipeline" +Structural: "Linear workflow: protocols → experiments → samples → measurements → analysis → publications" +Statistical: "High correlation between protocol_type and measurement_outcome" +``` + +**Phase 16-30 (Hypotheses):** +``` +Hypothesis: "Longitudinal study design" +→ Structural: "Found repeated_measurements, time_points tables" +→ Confirmed: "Same subjects measured over time" + +Hypothesis: "Control groups present" +→ Statistical: "Found clustering in measurements (treatment vs control)" +→ Confirmed: "Experimental design includes control groups" + +Hypothesis: "Statistical significance testing" +→ Statistical: "Found p_value distributions, confidence intervals in results" +→ Confirmed: "Clinical trial data with statistical validation" +``` + +**Phase 31-40 (Synthesis):** +``` +Final Understanding: +"Clinical trial data management system for pharmaceutical research +- Drug compound testing with control/treatment groups +- Longitudinal design (repeated measurements over time) +- Statistical validation pipeline +- Regulatory reporting (publication tracking) +- Sample tracking from collection to analysis" +``` + +### Example 3: E-commerce Database + +**Phase 1-5 (Blind):** +``` +Structural: "customers, orders, order_items, products, categories, inventory, reviews" +Statistical: "orders has 5.4M rows, steady growth trend, seasonal patterns" +Semantic: "Online retail platform" +Query: "orders table requires date-based partitioning" +``` + +**Phase 6-15 (Patterns):** +``` +Domain: "B2C ecommerce platform" +Relationships: "customers → orders (1:N), orders → order_items (1:N), order_items → products (N:1)" +Business flow: "Browse → Add to Cart → Checkout → Payment → Fulfillment" +Statistical: "Order value distribution: Long tail, $50 median, $280 mean" +``` + +**Phase 16-30 (Hypotheses):** +``` +Hypothesis: "Customer segments exist" +→ Statistical: "Cluster customers by order frequency, total spend, recency" +→ Found: "3 segments: Casual (70%), Regular (25%), VIP (5%)" + +Hypothesis: "Product categories affect return rates" +→ Statistical: "analyze returns by category" +→ Found: "Clothing: 12% return rate, Electronics: 3% return rate" + +Hypothesis: "Seasonal buying patterns" +→ Statistical: "Time series analysis of orders by month/day/week" +→ Found: "Peak: Nov-Dec (holidays), Dip: Jan, Slow: Feb-Mar" +``` + +**Phase 31-40 (Synthesis):** +``` +Final Understanding: +"Consumer ecommerce platform with: +- 5.4M orders, steady growth, strong seasonality +- 3 customer segments (Casual/Regular/VIP) with different behaviors +- 15% overall return rate (varies by category) +- Peak season: Nov-Dec (4.3x normal volume) +- Key metrics: conversion rate, AOV, customer lifetime value, return rate" +``` + +## Catalog Schema + +The catalog serves as shared memory for all experts. Key entry types: + +### Structure Entries +```json +{ + "kind": "structure", + "key": "table.customers", + "document": { + "columns": ["customer_id", "name", "email", "created_at"], + "primary_key": "customer_id", + "foreign_keys": [{"column": "region_id", "references": "regions(id)"}], + "row_count": 125000 + }, + "tags": "customers,table" +} +``` + +### Statistics Entries +```json +{ + "kind": "statistics", + "key": "customers.created_at", + "document": { + "distinct_count": 118500, + "null_percentage": 0.0, + "min": "2020-01-15", + "max": "2025-01-10", + "distribution": "uniform_growth" + }, + "tags": "customers,created_at,temporal" +} +``` + +### Semantic Entries +```json +{ + "kind": "semantic", + "key": "entity.customers", + "document": { + "entity_type": "Customer", + "definition": "Individual shoppers who place orders", + "business_role": "Revenue generator", + "lifecycle": "Registered → Active → Inactive → Churned" + }, + "tags": "semantic,entity,customers" +} +``` + +### Relationship Entries +```json +{ + "kind": "relationship", + "key": "customers↔orders", + "document": { + "type": "one_to_many", + "join_key": "customer_id", + "business_meaning": "Customers place multiple orders", + "cardinality_estimates": { + "min_orders_per_customer": 1, + "max_orders_per_customer": 247, + "avg_orders_per_customer": 4.3 + } + }, + "tags": "relationship,customers,orders" +} +``` + +### Hypothesis Entries +```json +{ + "kind": "hypothesis", + "key": "vip_segment_behavior", + "document": { + "hypothesis": "VIP customers have higher order frequency and AOV", + "status": "confirmed", + "confidence": 0.92, + "evidence": [ + "VIP avg 12.4 orders/year vs 2.1 for regular", + "VIP avg AOV $156 vs $45 for regular" + ] + }, + "tags": "hypothesis,customer_segments,confirmed" +} +``` + +### Collaboration Entries +```json +{ + "kind": "collaboration", + "key": "semantic_interpretation_001", + "document": { + "trigger": "Structural expert found orders.status enum", + "expert": "semantic", + "interpretation": "Order lifecycle: pending → confirmed → shipped → delivered", + "follow_up_tasks": ["Analyze time_in_status durations", "Find bottleneck status"] + }, + "tags": "collaboration,structural,semantic,order_lifecycle" +} +``` + +## Stopping Criteria + +The orchestrator evaluates whether to continue exploration based on: + +1. **Confidence Threshold** - Overall confidence in understanding exceeds target (e.g., 0.95) +2. **Coverage Threshold** - Sufficient percentage of database explored (e.g., 95% of tables analyzed) +3. **Diminishing Returns** - Last N iterations produced minimal new insights +4. **Resource Limits** - Maximum iterations reached or time budget exceeded +5. **Expert Consensus** - All experts indicate satisfactory understanding + +```python +def should_stop(self): + # High confidence in core understanding + if self.state["confidence"] >= 0.95: + return True, "Confidence threshold reached" + + # Good coverage of database + if self.state["coverage"] >= 0.95: + return True, "Coverage threshold reached" + + # Diminishing returns + if self.state["recent_insights"] < 2: + self.state["diminishing_returns"] += 1 + if self.state["diminishing_returns"] >= 3: + return True, "Diminishing returns" + + # Expert consensus + if all(expert.satisfied() for expert in self.experts): + return True, "Expert consensus achieved" + + return False, "Continue exploration" +``` + +## Implementation Considerations + +### Scalability + +For large databases (hundreds/thousands of tables): +- **Parallel Exploration**: Experts work simultaneously on different table subsets +- **Incremental Coverage**: Prioritize important tables (many relationships, high cardinality) +- **Smart Sampling**: Use statistical sampling instead of full scans for large tables +- **Progressive Refinement**: Start with overview, drill down iteratively + +### Performance + +- **Caching**: Cache catalog queries to avoid repeated reads +- **Batch Operations**: Group multiple tool calls when possible +- **Index-Aware**: Let Query Expert guide exploration to use indexed columns +- **Connection Pooling**: Reuse database connections (already implemented in MCP) + +### Error Handling + +- **Graceful Degradation**: If one expert fails, others continue +- **Retry Logic**: Transient errors trigger retries with backoff +- **Partial Results**: Catalog stores partial findings if interrupted +- **Validation**: Experts cross-validate each other's findings + +### Extensibility + +- **Pluggable Experts**: New expert types can be added easily +- **Domain-Specific Experts**: Specialized experts for healthcare, finance, etc. +- **Custom Tools**: Additional MCP tools for specific analysis needs +- **Expert Configuration**: Experts can be configured/enabled based on needs + +## Usage Example + +```python +from discovery_agent import DiscoveryOrchestrator + +# Initialize agent +agent = DiscoveryOrchestrator( + mcp_endpoint="https://localhost:6071/mcp/query", + auth_token="your_token" +) + +# Run discovery +report = agent.discover( + max_iterations=50, + target_confidence=0.95 +) + +# Access findings +print(report["summary"]) +print(report["domain"]) +print(report["key_insights"]) + +# Query catalog for specific information +customers_analysis = agent.catalog.search("customers") +relationships = agent.catalog.get_kind("relationship") +``` + +## Related Documentation + +- [Architecture.md](Architecture.md) - Overall MCP architecture +- [README.md](README.md) - Module overview and setup +- [VARIABLES.md](VARIABLES.md) - Configuration variables reference + +## Version History + +- **1.0** (2025-01-12) - Initial architecture design