# Database Question Capabilities Showcase ## Multi-Agent Discovery System This document showcases the comprehensive range of questions that can be answered based on the multi-agent database discovery performed via MCP server on the `testdb` e-commerce database. --- ## Overview The discovery was conducted by **4 collaborating subagents** across **4 rounds** of analysis: | Agent | Focus Area | |-------|-----------| | **Structural Agent** | Schema mapping, relationships, constraints, indexes | | **Statistical Agent** | Data distributions, patterns, anomalies, quality | | **Semantic Agent** | Business domain, entity types, production readiness | | **Query Agent** | Access patterns, optimization, performance analysis | --- ## Complete Question Taxonomy ### 1️⃣ Schema & Architecture Questions Questions about database structure, design, and implementation details. | Question Type | Example Questions | |--------------|-------------------| | **Table Structure** | "What columns does the `orders` table have?", "What are the data types for all customer fields?", "Show me the complete CREATE TABLE statement for products" | | **Relationships** | "What is the relationship between orders and customers?", "Which tables connect orders to products?", "Is this a one-to-many or many-to-many relationship?" | | **Index Analysis** | "Which indexes exist on the orders table?", "Why is there no composite index on (customer_id, order_date)?", "What indexes are missing?" | | **Missing Elements** | "What indexes are missing?", "Why are there no foreign key constraints?", "What would make this schema complete?" | | **Design Patterns** | "What design pattern was used for the order_items table?", "Is this a star schema or snowflake?", "Why use a junction table here?" | | **Constraint Analysis** | "What constraints are enforced at the database level?", "Why are there no CHECK constraints?", "What validation is missing?" | **I can answer:** Complete schema documentation, relationship diagrams, index recommendations, constraint analysis, design pattern explanations. --- ### 2️⃣ Data Content & Statistics Questions Questions about the actual data stored in the database. | Question Type | Example Questions | |--------------|-------------------| | **Cardinality** | "How many unique customers exist?", "What is the actual row count after deduplication?", "How many distinct values are in each column?" | | **Distributions** | "What is the distribution of order statuses?", "Which categories have the most products?", "Show me the value distribution of order totals" | | **Aggregations** | "What is the total revenue?", "What is the average order value?", "Which customer spent the most?", "What is the median order value?" | | **Ranges** | "What is the price range of products?", "What dates are covered by the orders?", "What is the min/max stock level?" | | **Top/Bottom N** | "Who are the top 3 customers by order count?", "Which product has the lowest stock?", "What are the 5 most expensive items?" | | **Correlations** | "Is there a correlation between product price and sales volume?", "Do customers who order expensive items tend to order more frequently?", "What is the correlation coefficient?" | | **Percentiles** | "What is the 90th percentile of order values?", "Which customers are in the top 10% by spend?" | **I can answer:** Exact counts, sums, averages, distributions, correlations, rankings, percentiles, statistical summaries. --- ### 3️⃣ Data Quality & Integrity Questions Questions about data health, accuracy, and anomalies. | Question Type | Example Questions | |--------------|-------------------| | **Duplication** | "Why are there 15 customers when only 5 are unique?", "Which records are duplicates?", "What is the duplication ratio?", "Identify all duplicate records" | | **Anomalies** | "Why are there orders from 2024 in a 2026 database?", "Why is every status exactly 33%?", "What temporal anomalies exist?" | | **Orphaned Records** | "Are there any orders pointing to non-existent customers?", "Do any order_items reference invalid products?", "Check referential integrity" | | **Validation** | "Is the email format consistent?", "Are there any negative prices or quantities?", "Validate data against business rules" | | **Statistical Tests** | "Does the order value distribution follow Benford's Law?", "Is the status distribution statistically uniform?", "What is the chi-square test result?" | | **Synthetic Detection** | "Is this real production data or synthetic test data?", "What evidence indicates this is synthetic data?", "Confidence level for synthetic classification" | | **Timeline Analysis** | "Why do orders predate their creation dates?", "What is the temporal impossibility?" | **I can answer:** Data quality scores, anomaly detection, statistical tests (chi-square, Benford's Law), duplication analysis, synthetic vs real data classification. --- ### 4️⃣ Performance & Optimization Questions Questions about query speed, indexing, and optimization. | Question Type | Example Questions | |--------------|-------------------| | **Query Analysis** | "Why is the customer order history query slow?", "What EXPLAIN output shows for this query?", "Analyze this query's performance" | | **Index Effectiveness** | "Which queries would benefit from a composite index?", "Why does the filesort happen?", "Are indexes being used?" | | **Performance Gains** | "How much faster will queries be after adding idx_customer_orderdate?", "What is the performance impact of deduplication?", "Quantify the improvement" | | **Bottlenecks** | "What is the slowest operation in the database?", "Where are the full table scans happening?", "Identify performance bottlenecks" | | **N+1 Patterns** | "Is there an N+1 query problem with order_items?", "Should I use JOIN or separate queries?", "Detect N+1 anti-patterns" | | **Optimization Priority** | "Which index should I add first?", "What gives the biggest performance improvement?", "Rank optimizations by impact" | | **Execution Plans** | "What is the EXPLAIN output for this query?", "What access type is being used?", "Why is it using ALL instead of index?" | **I can answer:** EXPLAIN plan analysis, index recommendations, performance projections (with numbers), bottleneck identification, N+1 pattern detection, optimization roadmaps. --- ### 5️⃣ Business & Domain Questions Questions about business meaning and operational capabilities. | Question Type | Example Questions | |--------------|-------------------| | **Domain Classification** | "What type of business is this database for?", "Is this e-commerce, healthcare, or finance?", "What industry does this serve?" | | **Entity Types** | "Which tables are fact tables vs dimension tables?", "What is the purpose of order_items?", "Classify each table by business function" | | **Business Rules** | "What is the order workflow?", "Does the system support returns or refunds?", "What business rules are enforced?" | | **Product Analysis** | "What is the product mix by category?", "Which product is the best seller?", "What is the price distribution?" | | **Customer Behavior** | "What is the customer retention rate?", "Which customers are most valuable?", "Describe customer purchasing patterns" | | **Business Insights** | "What is the average order value?", "What percentage of orders are pending vs completed?", "What are the key business metrics?" | | **Workflow Analysis** | "Can a customer cancel an order?", "How does order status transition work?", "What processes are supported?" | **I can answer:** Business domain classification, entity type classification, business rule documentation, workflow analysis, customer insights, product analysis. --- ### 6️⃣ Production Readiness & Maturity Questions Questions about deployment readiness and gaps. | Question Type | Example Questions | |--------------|-------------------| | **Readiness Score** | "How production-ready is this database?", "What percentage readiness does this system have?", "Can this go to production?" | | **Missing Features** | "What critical tables are missing?", "Can this system process payments?", "What functionality is absent?" | | **Capability Assessment** | "Can this system handle shipping?", "Is there inventory tracking?", "Can customers return items?", "What can't this system do?" | | **Gap Analysis** | "What is needed for production deployment?", "How long until this is production-ready?", "Create a gap analysis" | | **Risk Assessment** | "What are the risks of deploying this to production?", "What would break if we went live tomorrow?", "Assess production risks" | | **Maturity Level** | "Is this enterprise-grade or small business?", "What development stage is this in?", "Rate the system maturity" | | **Timeline Estimation** | "How many months to production readiness?", "What is the minimum viable timeline?" | **I can answer:** Production readiness percentage, gap analysis, risk assessment, timeline estimates (3-4 months minimum viable, 6-8 months full production), missing entity inventory. --- ### 7️⃣ Root Cause & Forensic Questions Questions about why problems exist and reconstructing events. | Question Type | Example Questions | |--------------|-------------------| | **Root Cause** | "Why is the data duplicated 3×?", "What caused the ETL to fail?", "What is the root cause of data quality issues?" | | **Timeline Analysis** | "When did the duplication happen?", "Why is there a 7.5 hour gap between batches?", "Reconstruct the event timeline" | | **Attribution** | "Who or what caused this issue?", "Was this a manual process or automated?", "What human actions led to this?" | | **Event Reconstruction** | "What sequence of events led to this state?", "Can you reconstruct the ETL failure scenario?", "What happened on 2026-01-11?" | | **Impact Tracing** | "How does the lack of FKs affect query performance?", "What downstream effects does duplication cause?", "Trace the impact chain" | | **Forensic Evidence** | "What timestamps prove this was manual intervention?", "Why do batch 2 and 3 have only 3 minutes between them?", "What is the smoking gun evidence?" | | **Causal Analysis** | "What caused the 3:1 duplication ratio?", "Why was INSERT used instead of MERGE?" | **I can answer:** Complete timeline reconstruction (16:07 → 23:44 → 23:48 on 2026-01-11), root cause identification (failed ETL with INSERT bug), forensic evidence analysis, causal chain documentation. --- ### 8️⃣ Remediation & Action Questions Questions about how to fix issues. | Question Type | Example Questions | |--------------|-------------------| | **Fix Priority** | "What should I fix first?", "Which issue is most critical?", "Prioritize the remediation steps" | | **SQL Generation** | "Write the SQL to deduplicate orders", "Generate the ALTER TABLE statements for FKs", "Create migration scripts" | | **Safety Checks** | "Is it safe to delete these duplicates?", "Will adding FKs break existing queries?", "What are the risks?" | | **Step-by-Step** | "What is the exact sequence to fix this database?", "Create a remediation plan", "Give me a 4-week roadmap" | | **Validation** | "How do I verify the deduplication worked?", "What tests should I run after adding indexes?", "Validate the fixes" | | **Rollback Plans** | "How do I undo the changes if something goes wrong?", "What is the rollback strategy?", "Create safety nets" | | **Implementation Guide** | "Provide ready-to-use SQL scripts", "What is the complete implementation guide?" | **I can answer:** Prioritized remediation plans (Priority 0-4), ready-to-use SQL scripts, safety validations, rollback strategies, 4-week implementation timeline. --- ### 9️⃣ Predictive & What-If Questions Questions about future states and hypothetical scenarios. | Question Type | Example Questions | |--------------|-------------------| | **Performance Projections** | "How much will storage shrink after deduplication?", "What will query time be after adding indexes?", "Project performance improvements" | | **Scenario Analysis** | "What happens if 1000 customers place orders simultaneously?", "Can this handle Black Friday traffic?", "Stress test scenarios" | | **Impact Forecasting** | "What is the business impact of not fixing this?", "How much revenue is being misreported?", "Forecast consequences" | | **Scaling Questions** | "When will we need to add more indexes?", "At what data volume will the current design fail?", "Scaling projections" | | **Growth Planning** | "How long before we need to partition tables?", "What will happen when we reach 1M orders?", "Growth capacity planning" | | **Cost-Benefit** | "Is it worth spending a week on deduplication?", "What is the ROI of adding these indexes?", "Business case analysis" | | **What-If Scenarios** | "What if we add a million customers?", "What if orders increase 10×?", "Hypothetical impact analysis" | **I can answer:** Performance projections (6-15× improvement), storage projections (67% reduction), scaling analysis, cost-benefit analysis, scenario modeling. --- ### 🔟 Comparative & Benchmarking Questions Questions comparing this database to others or standards. | Question Type | Example Questions | |--------------|-------------------| | **Before/After** | "How does the database compare before and after deduplication?", "What changed between Round 1 and Round 4?", "Show the evolution" | | **Best Practices** | "How does this schema compare to industry standards?", "Is this normal for an e-commerce database?", "Best practices comparison" | | **Tool Comparison** | "How would PostgreSQL handle this differently than MySQL?", "What if we used a document database?", "Cross-platform comparison" | | **Design Alternatives** | "Should we use a view or materialized view?", "Would a star schema be better than normalized?", "Alternative designs" | | **Version Differences** | "How does MySQL 8 compare to MySQL 5.7 for this workload?", "What would change with a different storage engine?", "Version impact analysis" | | **Competitive Analysis** | "How does our design compare to Shopify/WooCommerce?", "What are we doing differently than industry leaders?", "Competitive benchmarking" | | **Industry Standards** | "How does this compare to the Northwind schema?", "What would a database architect say about this?" | **I can answer:** Before/after comparisons, best practices assessment, alternative design proposals, industry standard comparisons, competitive analysis. --- ### 1️⃣1️⃣ Security & Compliance Questions Questions about data protection, access control, and regulatory compliance. | Question Type | Example Questions | |--------------|-------------------| | **Data Privacy** | "Is PII properly protected?", "Are customer emails stored securely?", "What personal data exists?" | | **Access Control** | "Who has access to what data?", "Are there any authentication mechanisms?", "Access security assessment" | | **Audit Trail** | "Can we track who changed what and when?", "Is there an audit log?", "Audit capability analysis" | | **Compliance** | "Does this meet GDPR requirements?", "Can we fulfill data deletion requests?", "Compliance assessment" | | **Injection Risks** | "Are there SQL injection vulnerabilities?", "Is input validation adequate?", "Security vulnerability scan" | | **Encryption** | "Is sensitive data encrypted at rest?", "Are passwords hashed?", "Encryption status" | | **Regulatory Requirements** | "What is needed for SOC 2 compliance?", "Does this meet PCI DSS requirements?" | **I can answer:** Security vulnerability assessment, compliance gap analysis (GDPR, SOC 2, PCI DSS), data privacy evaluation, audit capability analysis. --- ### 1️⃣2️⃣ Educational & Explanatory Questions Questions asking for explanations and learning. | Question Type | Example Questions | |--------------|-------------------| | **Concept Explanation** | "What is a foreign key and why does this database lack them?", "Explain the purpose of composite indexes", "What is a junction table?" | | **Why Questions** | "Why use a junction table?", "Why is there no CASCADE delete?", "Why are statuses strings not enums?", "Why did the architect choose this design?" | | **How It Works** | "How does the order_items table enable many-to-many relationships?", "How would you implement categories?", "Explain the data flow" | | **Trade-offs** | "What are the pros and cons of the current design?", "Why choose normalization vs denormalization?", "Design trade-off analysis" | | **Best Practice Teaching** | "What should have been done differently?", "Teach me proper e-commerce schema design", "Best practices for this domain" | | **Anti-Patterns** | "What are the database anti-patterns here?", "Why is this considered bad design?", "Anti-pattern identification" | | **Learning Path** | "What should a junior developer learn from this database?", "Create a curriculum based on this case study" | **I can answer:** Concept explanations (foreign keys, indexes, normalization), design rationale, trade-off analysis, best practices teaching, anti-pattern identification. --- ### 1️⃣3️⃣ Integration & Ecosystem Questions Questions about how this database fits with other systems. | Question Type | Example Questions | |--------------|-------------------| | **Application Fit** | "What application frameworks work best with this schema?", "How would an ORM map these tables?", "Framework compatibility" | | **API Design** | "What REST endpoints would this schema support?", "What GraphQL queries are possible?", "API design recommendations" | | **Data Pipeline** | "How would you ETL this to a data warehouse?", "Can this be exported to CSV/JSON/XML?", "Data pipeline design" | | **Analytics** | "How would you connect this to BI tools?", "What dashboards could be built?", "Analytics integration" | | **Search** | "How would you integrate Elasticsearch?", "Why is full-text search missing?", "Search integration" | | **Caching** | "What should be cached in Redis?", "Where would memcached help?", "Caching strategy" | | **Message Queues** | "How would Kafka/RabbitMQ integrate?", "What events should be published?" | **I can answer:** Framework recommendations (Django, Rails, Entity Framework), API endpoint design, ETL pipeline recommendations, BI tool integration, caching strategies. --- ### 1️⃣4️⃣ Advanced Multi-Agent Questions Questions about the discovery process itself and agent collaboration. | Question Type | Example Questions | |--------------|-------------------| | **Cross-Agent Synthesis** | "What do all 4 agents agree on?", "Where do agents disagree and why?", "Consensus analysis" | | **Confidence Assessment** | "How confident are you that this is synthetic data?", "What is the statistical confidence level?", "Confidence scoring" | | **Agent Collaboration** | "How did the structural agent validate the semantic agent's findings?", "What did the query agent learn from the statistical agent?", "Agent interaction analysis" | | **Round Evolution** | "How did understanding improve from Round 1 to Round 4?", "What new hypotheses emerged in later rounds?", "Discovery evolution" | | **Evidence Chain** | "What is the complete evidence chain for the ETL failure conclusion?", "How was the 3:1 duplication ratio confirmed?", "Evidence documentation" | | **Meta-Analysis** | "What would a 5th agent discover?", "Are there any blind spots in the multi-agent approach?", "Methodology critique" | | **Process Documentation** | "How was the multi-agent discovery orchestrated?", "What was the workflow?", "Process explanation" | **I can answer:** Cross-agent consensus analysis (95%+ agreement on critical findings), confidence assessments (99% synthetic data confidence), evidence chain documentation, methodology critique. --- ## Quick-Fire Example Questions Here are specific questions I can answer right now, organized by complexity: ### Simple Questions - "How many tables are in the database?" → 4 base tables + 1 view - "What is the primary key of customers?" → id (int) - "What indexes exist on orders?" → PRIMARY, idx_customer, idx_status - "How many unique products exist?" → 5 (after deduplication) - "What is the total actual revenue?" → $2,622.92 ### Medium Questions - "Why is there a 7.5 hour gap between data loads?" → Manual intervention (lunch break → evening session) - "What is the evidence this is synthetic data?" → Chi-square χ²=0, @example.com emails, perfect uniformity - "Which index should I add first?" → idx_customer_orderdate for customer queries - "Is it safe to delete duplicate customers?" → Yes, orders only reference IDs 1-4 - "What is the production readiness percentage?" → 5-30% ### Complex Questions - "Reconstruct the complete ETL failure scenario with timeline" → 3 batches at 16:07, 23:44, 23:48 on 2026-01-11 caused by INSERT bug instead of MERGE - "What is the statistical confidence this is synthetic data?" → 99.9% (p<0.001 for Benford's Law violation) - "Generate complete SQL migration to fix all issues" → Week-by-week scripts for deduplication, FKs, indexes, constraints - "What is the exact performance improvement after all optimizations?" → 6-15× overall improvement - "Create a 4-week remediation roadmap" → Week 1: deduplication, Week 2: FKs, Week 3: indexes, Week 4: unique constraints ### Expert Questions - "Cross-reference all agent findings and identify where consensus was achieved" → 95%+ agreement on 3:1 duplication, missing FKs, synthetic data, performance issues - "What would a PostgreSQL expert criticize about this MySQL design?" → Missing FK constraints, no CHECK constraints, no native enums for status, using timestamp vs timestamptz - "How does this compare to industry-standard e-commerce schemas like Magento/WooCommerce?" → Missing 80% of entities (payments, shipments, returns, addresses, inventory_transactions) - "What is the complete evidence chain proving manual ETL intervention?" → 7.5-hour gap (human activity), 3-minute retry (error correction), synchronous timestamps across tables - "Create a business case for remediation with ROI calculation" → $5,245.84 revenue overstatement prevented, 67% storage savings, 6-15× performance improvement --- ## Example Answers I Can Provide ### Example 1: Data Quality Question **Q:** "Prove this is synthetic data using statistical evidence" **A:** - **Chi-square test for status distribution:** χ²=0, p=1.0 (perfect uniformity - impossible in nature) - **Benford's Law violation:** All order totals start with "1" (p<0.001 for natural data) - **Price-volume correlation:** r=0.0 (should be r≈-0.5 for real sales data) - **Timeline impossibility:** Orders from 2024 in a 2026 database - **Email domains:** 100% @example.com (reserved for documentation/testing) - **Temporal forensics:** Three synchronous batches on 2026-01-11 **Conclusion:** 99.9% confidence this is synthetic test data. --- ### Example 2: Performance Question **Q:** "Why are customer order queries slow and how do I fix it?" **A:** - **Current query:** `SELECT * FROM orders WHERE customer_id = ? ORDER BY order_date DESC` - **EXPLAIN output:** `type: ref, key: idx_customer, Extra: Using filesort` - **Problem:** Single-column index `idx_customer` filters but cannot sort → filesort required - **Impact:** 30-40% slower query performance - **Solution:** `CREATE INDEX idx_customer_orderdate ON orders(customer_id, order_date DESC);` - **Expected improvement:** 10× faster (5ms → 0.5ms), eliminates filesort --- ### Example 3: Root Cause Question **Q:** "What caused the 3:1 data duplication?" **A:** - **Timeline reconstruction:** - Batch 1: 2026-01-11 16:07:29 (IDs 1-5) - Batch 2: 2026-01-11 23:44:54 (IDs 6-10) [+7h 37m gap] - Batch 3: 2026-01-11 23:48:04 (IDs 11-15) [+3m gap] - **Root cause:** ETL refresh script executed 3 times using INSERT instead of MERGE/UPSERT - **Evidence of manual intervention:** - 7.5-hour gap suggests human break (lunch → evening) - 3-minute gap suggests error correction or quick retry - Automated systems would show consistent intervals - **Why INSERT failed:** No UNIQUE constraints on natural keys (email, product name, order signature) allowed duplicates --- ### Example 4: Remediation Question **Q:** "Generate the complete SQL to fix this database" **A:** ```sql -- Week 1: Deduplication (Priority 0) DELETE t1 FROM orders t1 INNER JOIN orders t2 ON t1.customer_id = t2.customer_id AND t1.order_date = t2.order_date AND t1.total = t2.total AND t1.status = t2.status WHERE t1.id > t2.id; DELETE c1 FROM customers c1 INNER JOIN customers c2 ON c1.email = c2.email WHERE c1.id > c2.id; -- Week 2: Foreign Keys (Priority 1) ALTER TABLE orders ADD CONSTRAINT fk_orders_customer FOREIGN KEY (customer_id) REFERENCES customers(id); -- Week 3: Composite Indexes (Priority 2) CREATE INDEX idx_customer_orderdate ON orders(customer_id, order_date DESC); CREATE INDEX idx_status_orderdate ON orders(status, order_date DESC); -- Week 4: Unique Constraints (Priority 3) ALTER TABLE customers ADD CONSTRAINT uk_customers_email UNIQUE (email); ``` --- ## Summary The multi-agent discovery system can answer questions across **14 major categories** covering: - **Technical:** Schema, data, performance, security - **Business:** Domain, readiness, workflows, capabilities - **Analytical:** Quality, statistics, anomalies, patterns - **Operational:** Remediation, optimization, implementation - **Educational:** Explanations, best practices, learning - **Advanced:** Multi-agent synthesis, evidence chains, confidence assessment **Key Capability:** Integration across 4 specialized agents provides comprehensive answers that single-agent analysis cannot achieve, combining structural, statistical, semantic, and query perspectives into actionable insights. --- *For the complete database discovery report, see `DATABASE_DISCOVERY_REPORT.md`*