mirror of https://github.com/sysown/proxysql
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
812 lines
29 KiB
812 lines
29 KiB
# Database Discovery Agent Architecture (Conceptual Design)
|
|
|
|
## Overview
|
|
|
|
This document describes a conceptual architecture for an AI-powered database discovery agent that could autonomously explore, understand, and analyze any database schema regardless of complexity or domain. The agent would use a mixture-of-experts approach where specialized LLM agents collaborate to build comprehensive understanding of database structures, data patterns, and business semantics.
|
|
|
|
**Note:** This is a conceptual design document. The actual ProxySQL MCP implementation uses a different approach based on the two-phase discovery architecture described in `Two_Phase_Discovery_Implementation.md`.
|
|
|
|
## Core Principles
|
|
|
|
1. **Domain Agnostic** - No assumptions about what the database contains; everything is discovered
|
|
2. **Iterative Exploration** - Not a one-time schema dump; continuous learning through multiple cycles
|
|
3. **Collaborative Intelligence** - Multiple experts with different perspectives work together
|
|
4. **Hypothesis-Driven** - Experts form hypotheses, test them, and refine understanding
|
|
5. **Confidence-Based** - Exploration continues until a confidence threshold is reached
|
|
|
|
## High-Level Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ ORCHESTRATOR AGENT │
|
|
│ - Manages exploration state │
|
|
│ - Coordinates expert agents │
|
|
│ - Synthesizes findings │
|
|
│ - Decides when exploration is complete │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
├─────────────────────────────────────┐
|
|
│ │
|
|
▼─────────────────▼ ▼─────────────────▼
|
|
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
|
|
│ STRUCTURAL EXPERT │ │ STATISTICAL EXPERT │ │ SEMANTIC EXPERT │
|
|
│ │ │ │ │ │
|
|
│ - Schemas & tables │ │ - Data distributions │ │ - Business meaning │
|
|
│ - Relationships │ │ - Patterns & trends │ │ - Domain concepts │
|
|
│ - Constraints │ │ - Outliers & anomalies │ │ - Entity types │
|
|
│ - Indexes & keys │ │ - Correlations │ │ - User intent │
|
|
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘
|
|
│ │ │
|
|
└───────────────────────────┼───────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────┐
|
|
│ SHARED CATALOG │
|
|
│ (SQLite + MCP) │
|
|
│ │
|
|
│ Expert discoveries │
|
|
│ Cross-expert notes │
|
|
│ Exploration state │
|
|
│ Hypotheses & results │
|
|
└─────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────┐
|
|
│ MCP Query Endpoint │
|
|
│ - Database access │
|
|
│ - Catalog operations │
|
|
│ - All tools available │
|
|
└─────────────────────────────────┘
|
|
```
|
|
|
|
## Expert Specializations
|
|
|
|
### 1. Structural Expert
|
|
|
|
**Focus:** Database topology and relationships
|
|
|
|
**Responsibilities:**
|
|
- Map all schemas, tables, and their relationships
|
|
- Identify primary keys, foreign keys, and constraints
|
|
- Analyze index patterns and access structures
|
|
- Detect table hierarchies and dependencies
|
|
- Identify structural patterns (star schema, snowflake, hierarchical, etc.)
|
|
|
|
**Exploration Strategy:**
|
|
```python
|
|
class StructuralExpert:
|
|
def explore(self, catalog):
|
|
# Iteration 1: Map the territory
|
|
tables = self.list_all_tables()
|
|
for table in tables:
|
|
schema = self.get_table_schema(table)
|
|
relationships = self.find_relationships(table)
|
|
|
|
catalog.save("structure", f"table.{table}", {
|
|
"columns": schema["columns"],
|
|
"primary_key": schema["pk"],
|
|
"foreign_keys": relationships,
|
|
"indexes": schema["indexes"]
|
|
})
|
|
|
|
# Iteration 2: Find connection points
|
|
for table_a, table_b in potential_pairs:
|
|
joins = self.suggest_joins(table_a, table_b)
|
|
if joins:
|
|
catalog.save("relationship", f"{table_a}↔{table_b}", joins)
|
|
|
|
# Iteration 3: Identify structural patterns
|
|
patterns = self.identify_patterns(catalog)
|
|
# "This looks like a star schema", "Hierarchical structure", etc.
|
|
```
|
|
|
|
**Output Examples:**
|
|
- "Found 47 tables across 3 schemas"
|
|
- "customers table has 1:many relationship with orders via customer_id"
|
|
- "Detected star schema: fact_orders with dims: customers, products, time"
|
|
- "Table hierarchy: categories → subcategories → products"
|
|
|
|
### 2. Statistical Expert
|
|
|
|
**Focus:** Data characteristics and patterns
|
|
|
|
**Responsibilities:**
|
|
- Profile data distributions for all columns
|
|
- Identify correlations between fields
|
|
- Detect outliers and anomalies
|
|
- Find temporal patterns and trends
|
|
- Calculate data quality metrics
|
|
|
|
**Exploration Strategy:**
|
|
```python
|
|
class StatisticalExpert:
|
|
def explore(self, catalog):
|
|
# Read structural discoveries first
|
|
tables = catalog.get_kind("table.*")
|
|
|
|
for table in tables:
|
|
# Profile each column
|
|
for col in table["columns"]:
|
|
stats = self.get_column_stats(table, col)
|
|
|
|
catalog.save("statistics", f"{table}.{col}", {
|
|
"distinct_count": stats["distinct"],
|
|
"null_percentage": stats["null_pct"],
|
|
"distribution": stats["histogram"],
|
|
"top_values": stats["top_20"],
|
|
"numeric_range": stats["min_max"] if numeric else None,
|
|
"anomalies": stats["outliers"]
|
|
})
|
|
|
|
# Find correlations
|
|
correlations = self.find_correlations(tables)
|
|
catalog.save("patterns", "correlations", correlations)
|
|
```
|
|
|
|
**Output Examples:**
|
|
- "orders.status has 4 values: pending (23%), confirmed (45%), shipped (28%), cancelled (4%)"
|
|
- "Strong correlation (0.87) between order_items.quantity and order_total"
|
|
- "Outlier detected: customer_age has values > 150 (likely data error)"
|
|
- "Temporal pattern: 80% of orders placed M-F, 9am-5pm"
|
|
|
|
### 3. Semantic Expert
|
|
|
|
**Focus:** Business meaning and domain understanding
|
|
|
|
**Responsibilities:**
|
|
- Infer business domain from data patterns
|
|
- Identify entity types and their roles
|
|
- Interpret relationships in business terms
|
|
- Understand user intent and use cases
|
|
- Document business rules and constraints
|
|
|
|
**Exploration Strategy:**
|
|
```python
|
|
class SemanticExpert:
|
|
def explore(self, catalog):
|
|
# Synthesize findings from other experts
|
|
structure = catalog.get_kind("structure.*")
|
|
stats = catalog.get_kind("statistics.*")
|
|
|
|
for table in structure:
|
|
# Infer domain from table name, columns, and data
|
|
domain = self.infer_domain(table, stats)
|
|
# "This is an ecommerce database"
|
|
|
|
# Understand entities
|
|
entity_type = self.identify_entity(table)
|
|
# "customers table = Customer entities"
|
|
|
|
# Understand relationships
|
|
for rel in catalog.get_relationships(table):
|
|
business_rel = self.interpret_relationship(rel)
|
|
# "customer has many orders"
|
|
catalog.save("semantic", f"rel.{table}.{other}", {
|
|
"relationship": business_rel,
|
|
"cardinality": "one-to-many",
|
|
"business_rule": "A customer can place multiple orders"
|
|
})
|
|
|
|
# Identify business processes
|
|
processes = self.infer_processes(structure, stats)
|
|
# "Order fulfillment flow: orders → order_items → products"
|
|
catalog.save("semantic", "processes", processes)
|
|
```
|
|
|
|
**Output Examples:**
|
|
- "Domain inference: E-commerce platform (B2C)"
|
|
- "Entity: customers represents individual shoppers, not businesses"
|
|
- "Business process: Order lifecycle = pending → confirmed → shipped → delivered"
|
|
- "Business rule: Customer cannot be deleted if they have active orders"
|
|
|
|
### 4. Query Expert
|
|
|
|
**Focus:** Efficient data access patterns
|
|
|
|
**Responsibilities:**
|
|
- Analyze query optimization opportunities
|
|
- Recommend index usage strategies
|
|
- Determine optimal join orders
|
|
- Design sampling strategies for exploration
|
|
- Identify performance bottlenecks
|
|
|
|
**Exploration Strategy:**
|
|
```python
|
|
class QueryExpert:
|
|
def explore(self, catalog):
|
|
# Analyze query patterns from structural expert
|
|
structure = catalog.get_kind("structure.*")
|
|
|
|
for table in structure:
|
|
# Suggest optimal access patterns
|
|
access_patterns = self.analyze_access_patterns(table)
|
|
catalog.save("query", f"access.{table}", {
|
|
"best_index": access_patterns["optimal_index"],
|
|
"join_order": access_patterns["optimal_join_order"],
|
|
"sampling_strategy": access_patterns["sample_method"]
|
|
})
|
|
```
|
|
|
|
**Output Examples:**
|
|
- "For customers table, use idx_email for lookups, idx_created_at for time ranges"
|
|
- "Join order: customers → orders → order_items (not reverse)"
|
|
- "Sample strategy: Use TABLESAMPLE for large tables, LIMIT 1000 for small"
|
|
|
|
## Orchestrator: The Conductor
|
|
|
|
The Orchestrator agent coordinates all experts and manages the overall discovery process.
|
|
|
|
```python
|
|
class DiscoveryOrchestrator:
|
|
"""Coordinates the collaborative discovery process"""
|
|
|
|
def __init__(self, mcp_endpoint):
|
|
self.mcp = MCPClient(mcp_endpoint)
|
|
self.catalog = CatalogClient(self.mcp)
|
|
|
|
self.experts = [
|
|
StructuralExpert(self.catalog),
|
|
StatisticalExpert(self.catalog),
|
|
SemanticExpert(self.catalog),
|
|
QueryExpert(self.catalog)
|
|
]
|
|
|
|
self.state = {
|
|
"iteration": 0,
|
|
"phase": "initial",
|
|
"confidence": 0.0,
|
|
"coverage": 0.0, # % of database explored
|
|
"expert_contributions": {e.name: 0 for e in self.experts}
|
|
}
|
|
|
|
def discover(self, max_iterations=50, target_confidence=0.95):
|
|
"""Main discovery loop"""
|
|
|
|
while self.state["iteration"] < max_iterations:
|
|
self.state["iteration"] += 1
|
|
|
|
# 1. ASSESS: What's the current state?
|
|
assessment = self.assess_progress()
|
|
|
|
# 2. PLAN: Which expert should work on what?
|
|
tasks = self.plan_next_tasks(assessment)
|
|
# Example: [
|
|
# {"expert": "structural", "task": "explore_orders_table", "priority": 0.8},
|
|
# {"expert": "semantic", "task": "interpret_customer_entity", "priority": 0.7},
|
|
# {"expert": "statistical", "task": "analyze_price_distribution", "priority": 0.6}
|
|
# ]
|
|
|
|
# 3. EXECUTE: Experts work in parallel
|
|
results = self.execute_tasks_parallel(tasks)
|
|
|
|
# 4. SYNTHESIZE: Combine findings
|
|
synthesis = self.synthesize_findings(results)
|
|
|
|
# 5. COLLABORATE: Experts share insights
|
|
self.facilitate_collaboration(synthesis)
|
|
|
|
# 6. REFLECT: Are we done?
|
|
self.update_state(synthesis)
|
|
|
|
if self.should_stop():
|
|
break
|
|
|
|
# 7. FINALIZE: Create comprehensive understanding
|
|
return self.create_final_report()
|
|
|
|
def plan_next_tasks(self, assessment):
|
|
"""Decide what each expert should do next"""
|
|
|
|
prompt = f"""
|
|
You are orchestrating database discovery. Current state:
|
|
{assessment}
|
|
|
|
Expert findings:
|
|
{self.format_expert_findings()}
|
|
|
|
Plan the next exploration tasks. Consider:
|
|
1. Which expert can contribute most valuable insights now?
|
|
2. What areas need more exploration?
|
|
3. Which expert findings should be verified or extended?
|
|
|
|
Output JSON array of tasks, each with:
|
|
- expert: which expert should do it
|
|
- task: what they should do
|
|
- priority: 0-1 (higher = more important)
|
|
- dependencies: [array of catalog keys this depends on]
|
|
"""
|
|
|
|
return self.llm_call(prompt)
|
|
|
|
def facilitate_collaboration(self, synthesis):
|
|
"""Experts exchange notes and build on each other's work"""
|
|
|
|
# Find points where experts should collaborate
|
|
collaborations = self.find_collaboration_opportunities(synthesis)
|
|
|
|
for collab in collaborations:
|
|
# Example: Structural found relationship, Semantic should interpret it
|
|
prompt = f"""
|
|
EXPERT COLLABORATION:
|
|
|
|
{collab['expert_a']} found: {collab['finding_a']}
|
|
|
|
{collab['expert_b']}: Please interpret this finding from your perspective.
|
|
Consider: How does this affect your understanding? What follow-up is needed?
|
|
|
|
Catalog context: {self.get_relevant_context(collab)}
|
|
"""
|
|
|
|
response = self.llm_call(prompt, expert=collab['expert_b'])
|
|
self.catalog.save("collaboration", collab['id'], response)
|
|
|
|
def create_final_report(self):
|
|
"""Synthesize all discoveries into comprehensive understanding"""
|
|
|
|
prompt = f"""
|
|
Create a comprehensive database understanding report from all expert findings.
|
|
|
|
Include:
|
|
1. Executive Summary
|
|
2. Database Structure Overview
|
|
3. Business Domain Analysis
|
|
4. Key Insights & Patterns
|
|
5. Data Quality Assessment
|
|
6. Usage Recommendations
|
|
|
|
Catalog data:
|
|
{self.catalog.export_all()}
|
|
"""
|
|
|
|
return self.llm_call(prompt)
|
|
```
|
|
|
|
## Discovery Phases
|
|
|
|
### Phase 1: Blind Exploration (Iterations 1-10)
|
|
|
|
**Characteristics:**
|
|
- All experts work independently on basic discovery
|
|
- No domain assumptions
|
|
- Systematic data collection
|
|
- Build foundational knowledge
|
|
|
|
**Expert Activities:**
|
|
- **Structural**: Map all tables, columns, relationships, constraints
|
|
- **Statistical**: Profile all columns, find distributions, cardinality
|
|
- **Semantic**: Identify entity types from naming patterns, infer basic domain
|
|
- **Query**: Analyze access patterns, identify indexes
|
|
|
|
**Output:**
|
|
- Complete table inventory
|
|
- Column profiles for all fields
|
|
- Basic relationship mapping
|
|
- Initial domain hypothesis
|
|
|
|
### Phase 2: Pattern Recognition (Iterations 11-30)
|
|
|
|
**Characteristics:**
|
|
- Experts begin collaborating
|
|
- Patterns emerge from data
|
|
- Domain becomes clearer
|
|
- Hypotheses form
|
|
|
|
**Expert Activities:**
|
|
- **Structural**: Identifies structural patterns (star schema, hierarchies)
|
|
- **Statistical**: Finds correlations, temporal patterns, outliers
|
|
- **Semantic**: Interprets relationships in business terms
|
|
- **Query**: Optimizes based on discovered patterns
|
|
|
|
**Example Collaboration:**
|
|
```
|
|
Structural → Catalog: "Found customers→orders relationship (customer_id)"
|
|
Semantic reads: "This indicates customers place orders (ecommerce)"
|
|
Statistical reads: "Analyzing order patterns by customer..."
|
|
Query: "Optimizing customer-centric queries using customer_id index"
|
|
```
|
|
|
|
**Output:**
|
|
- Domain identification (e.g., "This is an ecommerce database")
|
|
- Business entity definitions
|
|
- Relationship interpretations
|
|
- Pattern documentation
|
|
|
|
### Phase 3: Hypothesis-Driven Exploration (Iterations 31-45)
|
|
|
|
**Characteristics:**
|
|
- Experts form and test hypotheses
|
|
- Deep dives into specific areas
|
|
- Validation of assumptions
|
|
- Filling knowledge gaps
|
|
|
|
**Example Hypotheses:**
|
|
- "This is a SaaS metrics database" → Test for subscription patterns
|
|
- "There are seasonal trends in orders" → Analyze temporal distributions
|
|
- "Data quality issues in customer emails" → Validate email formats
|
|
- "Unused indexes exist" → Check index usage statistics
|
|
|
|
**Expert Activities:**
|
|
- All experts design experiments to test hypotheses
|
|
- Catalog stores hypothesis results (confirmed/refined/refuted)
|
|
- Collaboration to refine understanding based on evidence
|
|
|
|
**Output:**
|
|
- Validated business insights
|
|
- Refined domain understanding
|
|
- Data quality assessment
|
|
- Performance optimization recommendations
|
|
|
|
### Phase 4: Synthesis & Validation (Iterations 46-50)
|
|
|
|
**Characteristics:**
|
|
- All experts collaborate to validate findings
|
|
- Resolve contradictions
|
|
- Fill remaining gaps
|
|
- Create unified understanding
|
|
|
|
**Expert Activities:**
|
|
- Cross-expert validation of key findings
|
|
- Synthesis of comprehensive understanding
|
|
- Documentation of uncertainties
|
|
- Recommendations for further analysis
|
|
|
|
**Output:**
|
|
- Final comprehensive report
|
|
- Confidence scores for each finding
|
|
- Remaining uncertainties
|
|
- Actionable recommendations
|
|
|
|
## Domain-Agnostic Discovery Examples
|
|
|
|
### Example 1: Law Firm Database
|
|
|
|
**Phase 1-5 (Blind):**
|
|
```
|
|
Structural: "Found: cases, clients, attorneys, documents, time_entries, billing_rates"
|
|
Statistical: "time_entries has 1.2M rows, highly skewed distribution, 15% null values"
|
|
Semantic: "Entity types: Cases (legal matters), Clients (people/companies), Attorneys"
|
|
Query: "Best access path: case_id → time_entries (indexed)"
|
|
```
|
|
|
|
**Phase 6-15 (Patterns):**
|
|
```
|
|
Collaboration:
|
|
Structural → Semantic: "cases have many-to-many with attorneys (case_attorneys table)"
|
|
Semantic: "Multiple attorneys per case = legal teams"
|
|
Statistical: "time_entries correlate with case_stage progression (r=0.72)"
|
|
Query: "Filter by case_date_first for time range queries (30% faster)"
|
|
|
|
Domain Inference:
|
|
Semantic: "Legal practice management system"
|
|
Structural: "Found invoices, payments tables - confirms practice management"
|
|
Statistical: "Billing patterns: hourly rates, contingency fees detected"
|
|
```
|
|
|
|
**Phase 16-30 (Hypotheses):**
|
|
```
|
|
Hypothesis: "Firm specializes in specific case types"
|
|
→ Statistical: "Analyze case_type distribution"
|
|
→ Found: "70% personal_injury, 20% corporate_litigation, 10% family_law"
|
|
|
|
Hypothesis: "Document workflow exists"
|
|
→ Structural: "Found document_versions, approvals, court_filings tables"
|
|
→ Semantic: "Document approval workflow for court submissions"
|
|
|
|
Hypothesis: "Attorney productivity varies by case type"
|
|
→ Statistical: "Analyze time_entries per attorney per case_type"
|
|
→ Found: "Personal injury cases require 3.2x more attorney hours"
|
|
```
|
|
|
|
**Phase 31-40 (Synthesis):**
|
|
```
|
|
Final Understanding:
|
|
"Mid-sized personal injury law firm (50-100 attorneys)
|
|
with practice management system including:
|
|
- Case management with document workflows
|
|
- Time tracking and billing (hourly + contingency)
|
|
- 70% focus on personal injury cases
|
|
- Average case duration: 18 months
|
|
- Key metrics: case duration, settlement amounts,
|
|
attorney productivity, document approval cycle time"
|
|
```
|
|
|
|
### Example 2: Scientific Research Database
|
|
|
|
**Phase 1-5 (Blind):**
|
|
```
|
|
Structural: "experiments, samples, measurements, researchers, publications, protocols"
|
|
Statistical: "High precision numeric data (10 decimal places), temporal patterns in experiments"
|
|
Semantic: "Research lab data management system"
|
|
Query: "Measurements table largest (45M rows), needs partitioning"
|
|
```
|
|
|
|
**Phase 6-15 (Patterns):**
|
|
```
|
|
Domain: "Biology/medicine research (gene_sequences, drug_compounds detected)"
|
|
Patterns: "Experiments follow protocol → samples → measurements → analysis pipeline"
|
|
Structural: "Linear workflow: protocols → experiments → samples → measurements → analysis → publications"
|
|
Statistical: "High correlation between protocol_type and measurement_outcome"
|
|
```
|
|
|
|
**Phase 16-30 (Hypotheses):**
|
|
```
|
|
Hypothesis: "Longitudinal study design"
|
|
→ Structural: "Found repeated_measurements, time_points tables"
|
|
→ Confirmed: "Same subjects measured over time"
|
|
|
|
Hypothesis: "Control groups present"
|
|
→ Statistical: "Found clustering in measurements (treatment vs control)"
|
|
→ Confirmed: "Experimental design includes control groups"
|
|
|
|
Hypothesis: "Statistical significance testing"
|
|
→ Statistical: "Found p_value distributions, confidence intervals in results"
|
|
→ Confirmed: "Clinical trial data with statistical validation"
|
|
```
|
|
|
|
**Phase 31-40 (Synthesis):**
|
|
```
|
|
Final Understanding:
|
|
"Clinical trial data management system for pharmaceutical research
|
|
- Drug compound testing with control/treatment groups
|
|
- Longitudinal design (repeated measurements over time)
|
|
- Statistical validation pipeline
|
|
- Regulatory reporting (publication tracking)
|
|
- Sample tracking from collection to analysis"
|
|
```
|
|
|
|
### Example 3: E-commerce Database
|
|
|
|
**Phase 1-5 (Blind):**
|
|
```
|
|
Structural: "customers, orders, order_items, products, categories, inventory, reviews"
|
|
Statistical: "orders has 5.4M rows, steady growth trend, seasonal patterns"
|
|
Semantic: "Online retail platform"
|
|
Query: "orders table requires date-based partitioning"
|
|
```
|
|
|
|
**Phase 6-15 (Patterns):**
|
|
```
|
|
Domain: "B2C ecommerce platform"
|
|
Relationships: "customers → orders (1:N), orders → order_items (1:N), order_items → products (N:1)"
|
|
Business flow: "Browse → Add to Cart → Checkout → Payment → Fulfillment"
|
|
Statistical: "Order value distribution: Long tail, $50 median, $280 mean"
|
|
```
|
|
|
|
**Phase 16-30 (Hypotheses):**
|
|
```
|
|
Hypothesis: "Customer segments exist"
|
|
→ Statistical: "Cluster customers by order frequency, total spend, recency"
|
|
→ Found: "3 segments: Casual (70%), Regular (25%), VIP (5%)"
|
|
|
|
Hypothesis: "Product categories affect return rates"
|
|
→ Statistical: "analyze returns by category"
|
|
→ Found: "Clothing: 12% return rate, Electronics: 3% return rate"
|
|
|
|
Hypothesis: "Seasonal buying patterns"
|
|
→ Statistical: "Time series analysis of orders by month/day/week"
|
|
→ Found: "Peak: Nov-Dec (holidays), Dip: Jan, Slow: Feb-Mar"
|
|
```
|
|
|
|
**Phase 31-40 (Synthesis):**
|
|
```
|
|
Final Understanding:
|
|
"Consumer ecommerce platform with:
|
|
- 5.4M orders, steady growth, strong seasonality
|
|
- 3 customer segments (Casual/Regular/VIP) with different behaviors
|
|
- 15% overall return rate (varies by category)
|
|
- Peak season: Nov-Dec (4.3x normal volume)
|
|
- Key metrics: conversion rate, AOV, customer lifetime value, return rate"
|
|
```
|
|
|
|
## Catalog Schema
|
|
|
|
The catalog serves as shared memory for all experts. Key entry types:
|
|
|
|
### Structure Entries
|
|
```json
|
|
{
|
|
"kind": "structure",
|
|
"key": "table.customers",
|
|
"document": {
|
|
"columns": ["customer_id", "name", "email", "created_at"],
|
|
"primary_key": "customer_id",
|
|
"foreign_keys": [{"column": "region_id", "references": "regions(id)"}],
|
|
"row_count": 125000
|
|
},
|
|
"tags": "customers,table"
|
|
}
|
|
```
|
|
|
|
### Statistics Entries
|
|
```json
|
|
{
|
|
"kind": "statistics",
|
|
"key": "customers.created_at",
|
|
"document": {
|
|
"distinct_count": 118500,
|
|
"null_percentage": 0.0,
|
|
"min": "2020-01-15",
|
|
"max": "2025-01-10",
|
|
"distribution": "uniform_growth"
|
|
},
|
|
"tags": "customers,created_at,temporal"
|
|
}
|
|
```
|
|
|
|
### Semantic Entries
|
|
```json
|
|
{
|
|
"kind": "semantic",
|
|
"key": "entity.customers",
|
|
"document": {
|
|
"entity_type": "Customer",
|
|
"definition": "Individual shoppers who place orders",
|
|
"business_role": "Revenue generator",
|
|
"lifecycle": "Registered → Active → Inactive → Churned"
|
|
},
|
|
"tags": "semantic,entity,customers"
|
|
}
|
|
```
|
|
|
|
### Relationship Entries
|
|
```json
|
|
{
|
|
"kind": "relationship",
|
|
"key": "customers↔orders",
|
|
"document": {
|
|
"type": "one_to_many",
|
|
"join_key": "customer_id",
|
|
"business_meaning": "Customers place multiple orders",
|
|
"cardinality_estimates": {
|
|
"min_orders_per_customer": 1,
|
|
"max_orders_per_customer": 247,
|
|
"avg_orders_per_customer": 4.3
|
|
}
|
|
},
|
|
"tags": "relationship,customers,orders"
|
|
}
|
|
```
|
|
|
|
### Hypothesis Entries
|
|
```json
|
|
{
|
|
"kind": "hypothesis",
|
|
"key": "vip_segment_behavior",
|
|
"document": {
|
|
"hypothesis": "VIP customers have higher order frequency and AOV",
|
|
"status": "confirmed",
|
|
"confidence": 0.92,
|
|
"evidence": [
|
|
"VIP avg 12.4 orders/year vs 2.1 for regular",
|
|
"VIP avg AOV $156 vs $45 for regular"
|
|
]
|
|
},
|
|
"tags": "hypothesis,customer_segments,confirmed"
|
|
}
|
|
```
|
|
|
|
### Collaboration Entries
|
|
```json
|
|
{
|
|
"kind": "collaboration",
|
|
"key": "semantic_interpretation_001",
|
|
"document": {
|
|
"trigger": "Structural expert found orders.status enum",
|
|
"expert": "semantic",
|
|
"interpretation": "Order lifecycle: pending → confirmed → shipped → delivered",
|
|
"follow_up_tasks": ["Analyze time_in_status durations", "Find bottleneck status"]
|
|
},
|
|
"tags": "collaboration,structural,semantic,order_lifecycle"
|
|
}
|
|
```
|
|
|
|
## Stopping Criteria
|
|
|
|
The orchestrator evaluates whether to continue exploration based on:
|
|
|
|
1. **Confidence Threshold** - Overall confidence in understanding exceeds target (e.g., 0.95)
|
|
2. **Coverage Threshold** - Sufficient percentage of database explored (e.g., 95% of tables analyzed)
|
|
3. **Diminishing Returns** - Last N iterations produced minimal new insights
|
|
4. **Resource Limits** - Maximum iterations reached or time budget exceeded
|
|
5. **Expert Consensus** - All experts indicate satisfactory understanding
|
|
|
|
```python
|
|
def should_stop(self):
|
|
# High confidence in core understanding
|
|
if self.state["confidence"] >= 0.95:
|
|
return True, "Confidence threshold reached"
|
|
|
|
# Good coverage of database
|
|
if self.state["coverage"] >= 0.95:
|
|
return True, "Coverage threshold reached"
|
|
|
|
# Diminishing returns
|
|
if self.state["recent_insights"] < 2:
|
|
self.state["diminishing_returns"] += 1
|
|
if self.state["diminishing_returns"] >= 3:
|
|
return True, "Diminishing returns"
|
|
|
|
# Expert consensus
|
|
if all(expert.satisfied() for expert in self.experts):
|
|
return True, "Expert consensus achieved"
|
|
|
|
return False, "Continue exploration"
|
|
```
|
|
|
|
## Implementation Considerations
|
|
|
|
### Scalability
|
|
|
|
For large databases (hundreds/thousands of tables):
|
|
- **Parallel Exploration**: Experts work simultaneously on different table subsets
|
|
- **Incremental Coverage**: Prioritize important tables (many relationships, high cardinality)
|
|
- **Smart Sampling**: Use statistical sampling instead of full scans for large tables
|
|
- **Progressive Refinement**: Start with overview, drill down iteratively
|
|
|
|
### Performance
|
|
|
|
- **Caching**: Cache catalog queries to avoid repeated reads
|
|
- **Batch Operations**: Group multiple tool calls when possible
|
|
- **Index-Aware**: Let Query Expert guide exploration to use indexed columns
|
|
- **Connection Pooling**: Reuse database connections (already implemented in MCP)
|
|
|
|
### Error Handling
|
|
|
|
- **Graceful Degradation**: If one expert fails, others continue
|
|
- **Retry Logic**: Transient errors trigger retries with backoff
|
|
- **Partial Results**: Catalog stores partial findings if interrupted
|
|
- **Validation**: Experts cross-validate each other's findings
|
|
|
|
### Extensibility
|
|
|
|
- **Pluggable Experts**: New expert types can be added easily
|
|
- **Domain-Specific Experts**: Specialized experts for healthcare, finance, etc.
|
|
- **Custom Tools**: Additional MCP tools for specific analysis needs
|
|
- **Expert Configuration**: Experts can be configured/enabled based on needs
|
|
|
|
## Usage Example
|
|
|
|
```python
|
|
from discovery_agent import DiscoveryOrchestrator
|
|
|
|
# Initialize agent
|
|
agent = DiscoveryOrchestrator(
|
|
mcp_endpoint="https://localhost:6071/mcp/query",
|
|
auth_token="your_token"
|
|
)
|
|
|
|
# Run discovery
|
|
report = agent.discover(
|
|
max_iterations=50,
|
|
target_confidence=0.95
|
|
)
|
|
|
|
# Access findings
|
|
print(report["summary"])
|
|
print(report["domain"])
|
|
print(report["key_insights"])
|
|
|
|
# Query catalog for specific information
|
|
customers_analysis = agent.catalog.search("customers")
|
|
relationships = agent.catalog.get_kind("relationship")
|
|
```
|
|
|
|
## Related Documentation
|
|
|
|
- [Architecture.md](Architecture.md) - Overall MCP architecture
|
|
- [README.md](README.md) - Module overview and setup
|
|
- [VARIABLES.md](VARIABLES.md) - Configuration variables reference
|
|
|
|
## Version History
|
|
|
|
- **1.0** (2025-01-12) - Initial architecture design
|
|
|
|
## Implementation Status
|
|
|
|
**Status:** Conceptual design - Not implemented
|
|
**Actual Implementation:** See for the actual ProxySQL MCP discovery implementation.
|
|
|
|
## Version
|
|
|
|
- **Last Updated:** 2026-01-19
|