mirror of https://github.com/sysown/proxysql
- Add mcp_config.example.json for Claude Code MCP configuration - Fix MCP bridge path in example config (../../proxysql_mcp_stdio_bridge.py) - Update Two_Phase_Discovery_Implementation.md with correct Phase 1/Phase 2 usage - Fix Two_Phase_Discovery_Implementation.md DELETE FROM fts_objects to scope to run_id - Update README.md with two-phase discovery section and multi-agent legacy note - Create static_harvest.sh bash wrapper for Phase 1 - Create two_phase_discovery.py orchestration script with prompts - Add --run-id parameter to skip auto-fetch - Fix RUN_ID placeholder mismatch (<USE_THE_PROVIDED_RUN_ID>) - Fix catalog path default to mcp_catalog.db - Add test_catalog.sh to verify catalog tools work - Fix Discovery_Schema.cpp FTS5 syntax (missing space) - Remove invalid CREATE INDEX on FTS virtual tables - Add MCP tool call logging to track tool usage - Fix Static_Harvester::get_harvest_stats() to accept run_id parameter - Fix DELETE FROM fts_objects to only delete for specific run_id - Update system prompts to say DO NOT call discovery.run_static - Update user prompts to say Phase 1 is already complete - Add --mcp-only flag to restrict Claude Code to MCP tools only - Make FTS table failures non-fatal (check if table exists first) - Add comprehensive documentation for both discovery approachespull/5318/head
parent
f9270e6c8b
commit
1b7335acfe
@ -0,0 +1,13 @@
|
||||
{
|
||||
"mcpServers": {
|
||||
"proxysql": {
|
||||
"command": "python3",
|
||||
"args": ["../../proxysql_mcp_stdio_bridge.py"],
|
||||
"env": {
|
||||
"PROXYSQL_MCP_ENDPOINT": "https://127.0.0.1:6071/mcp/query",
|
||||
"PROXYSQL_MCP_TOKEN": "",
|
||||
"PROXYSQL_MCP_INSECURE_SSL": "1"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -0,0 +1,157 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# static_harvest.sh - Wrapper for Phase 1 static discovery
|
||||
#
|
||||
# Triggers ProxySQL's deterministic metadata harvest via the MCP endpoint.
|
||||
# No Claude Code required.
|
||||
#
|
||||
# Usage:
|
||||
# ./static_harvest.sh [--schema SCHEMA] [--notes NOTES] [--endpoint URL]
|
||||
#
|
||||
# Examples:
|
||||
# ./static_harvest.sh # Harvest all schemas
|
||||
# ./static_harvest.sh --schema sales # Harvest specific schema
|
||||
# ./static_harvest.sh --schema production --notes "Prod DB discovery"
|
||||
# ./static_harvest.sh --endpoint https://192.168.1.100:6071/mcp/query
|
||||
|
||||
set -e
|
||||
|
||||
# Default values
|
||||
ENDPOINT="${PROXYSQL_MCP_ENDPOINT:-https://127.0.0.1:6071/mcp/query}"
|
||||
SCHEMA_FILTER=""
|
||||
NOTES=""
|
||||
|
||||
# Parse arguments
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
--schema)
|
||||
SCHEMA_FILTER="$2"
|
||||
shift 2
|
||||
;;
|
||||
--notes)
|
||||
NOTES="$2"
|
||||
shift 2
|
||||
;;
|
||||
--endpoint)
|
||||
ENDPOINT="$2"
|
||||
shift 2
|
||||
;;
|
||||
-h|--help)
|
||||
echo "Usage: $0 [--schema SCHEMA] [--notes NOTES] [--endpoint URL]"
|
||||
echo ""
|
||||
echo "Options:"
|
||||
echo " --schema SCHEMA Restrict harvest to one MySQL schema (optional)"
|
||||
echo " --notes NOTES Optional notes for this discovery run"
|
||||
echo " --endpoint URL ProxySQL MCP endpoint (default: PROXYSQL_MCP_ENDPOINT env var or https://127.0.0.1:6071/mcp/query)"
|
||||
echo " -h, --help Show this help message"
|
||||
echo ""
|
||||
echo "Environment Variables:"
|
||||
echo " PROXYSQL_MCP_ENDPOINT Default MCP endpoint URL"
|
||||
echo ""
|
||||
echo "Examples:"
|
||||
echo " $0 # Harvest all schemas"
|
||||
echo " $0 --schema sales # Harvest specific schema"
|
||||
echo " $0 --schema production --notes 'Prod DB discovery'"
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "Error: Unknown option: $1"
|
||||
echo "Use --help for usage information"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Build JSON arguments
|
||||
JSON_ARGS="{}"
|
||||
|
||||
if [[ -n "$SCHEMA_FILTER" ]]; then
|
||||
JSON_ARGS=$(echo "$JSON_ARGS" | jq --arg schema "$SCHEMA_FILTER" '. + {schema_filter: $schema}')
|
||||
fi
|
||||
|
||||
if [[ -n "$NOTES" ]]; then
|
||||
JSON_ARGS=$(echo "$JSON_ARGS" | jq --arg notes "$NOTES" '. + {notes: $notes}')
|
||||
fi
|
||||
|
||||
# Build the full JSON-RPC request
|
||||
JSON_REQUEST=$(jq -n \
|
||||
--argjson args "$JSON_ARGS" \
|
||||
'{
|
||||
jsonrpc: "2.0",
|
||||
id: 1,
|
||||
method: "tools/call",
|
||||
params: {
|
||||
name: "discovery.run_static",
|
||||
arguments: $args
|
||||
}
|
||||
}')
|
||||
|
||||
# Display what we're doing
|
||||
echo "=== Phase 1: Static Harvest ==="
|
||||
echo "Endpoint: $ENDPOINT"
|
||||
if [[ -n "$SCHEMA_FILTER" ]]; then
|
||||
echo "Schema: $SCHEMA_FILTER"
|
||||
else
|
||||
echo "Schema: all schemas"
|
||||
fi
|
||||
if [[ -n "$NOTES" ]]; then
|
||||
echo "Notes: $NOTES"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Execute the curl command
|
||||
# Disable SSL verification (-k) for self-signed certificates
|
||||
curl_result=$(curl -k -s -X POST "$ENDPOINT" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "$JSON_REQUEST")
|
||||
|
||||
# Check for curl errors
|
||||
if [[ $? -ne 0 ]]; then
|
||||
echo "Error: Failed to connect to ProxySQL MCP endpoint at $ENDPOINT"
|
||||
echo "Make sure ProxySQL is running with MCP enabled."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check for database directory errors
|
||||
if echo "$curl_result" | grep -q "no such table: fts_objects"; then
|
||||
echo ""
|
||||
echo "Error: FTS table missing. This usually means the discovery catalog directory doesn't exist."
|
||||
echo "Please create it:"
|
||||
echo " sudo mkdir -p /var/lib/proxysql"
|
||||
echo " sudo chown \$USER:\$USER /var/lib/proxysql"
|
||||
echo "Then restart ProxySQL."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Pretty-print the result
|
||||
echo "$curl_result" | jq .
|
||||
|
||||
# Check for JSON-RPC errors
|
||||
if echo "$curl_result" | jq -e '.error' > /dev/null 2>&1; then
|
||||
echo ""
|
||||
echo "Error: Server returned an error:"
|
||||
echo "$curl_result" | jq -r '.error.message'
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Display summary - extract from nested content[0].text JSON string
|
||||
echo ""
|
||||
if echo "$curl_result" | jq -e '.result.content[0].text' > /dev/null 2>&1; then
|
||||
# Extract the JSON string from content[0].text and parse it
|
||||
INNER_JSON=$(echo "$curl_result" | jq -r '.result.content[0].text' 2>/dev/null)
|
||||
|
||||
if [[ -n "$INNER_JSON" ]]; then
|
||||
RUN_ID=$(echo "$INNER_JSON" | jq -r '.run_id // empty')
|
||||
OBJECTS_COUNT=$(echo "$INNER_JSON" | jq -r '.objects.table // 0')
|
||||
COLUMNS_COUNT=$(echo "$INNER_JSON" | jq -r '.columns // 0')
|
||||
INDEXES_COUNT=$(echo "$INNER_JSON" | jq -r '.indexes // 0')
|
||||
FKS_COUNT=$(echo "$INNER_JSON" | jq -r '.foreign_keys // 0')
|
||||
|
||||
echo "=== Harvest Summary ==="
|
||||
echo "Run ID: $RUN_ID"
|
||||
echo "Objects discovered: $OBJECTS_COUNT"
|
||||
echo "Columns discovered: $COLUMNS_COUNT"
|
||||
echo "Indexes discovered: $INDEXES_COUNT"
|
||||
echo "Foreign keys discovered: $FKS_COUNT"
|
||||
fi
|
||||
fi
|
||||
@ -0,0 +1,77 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# Test catalog tools directly to verify they work
|
||||
#
|
||||
|
||||
set -e
|
||||
|
||||
MCP_ENDPOINT="${PROXYSQL_MCP_ENDPOINT:-https://127.0.0.1:6071/mcp/query}"
|
||||
RUN_ID="${1:-10}"
|
||||
|
||||
echo "=== Catalog Tools Test ==="
|
||||
echo "Using MCP endpoint: $MCP_ENDPOINT"
|
||||
echo "Using run_id: $RUN_ID"
|
||||
echo ""
|
||||
|
||||
echo "1. Testing catalog.list_objects..."
|
||||
curl -k -s -X POST "$MCP_ENDPOINT" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"jsonrpc": "2.0",
|
||||
"id": 1,
|
||||
"method": "tools/call",
|
||||
"params": {
|
||||
"name": "catalog.list_objects",
|
||||
"arguments": {
|
||||
"run_id": '$RUN_ID',
|
||||
"order_by": "name",
|
||||
"page_size": 5
|
||||
}
|
||||
}
|
||||
}' | jq .
|
||||
|
||||
echo ""
|
||||
echo "2. Testing catalog.get_object..."
|
||||
curl -k -s -X POST "$MCP_ENDPOINT" \
|
||||
-H "Content_type: application/json" \
|
||||
-d '{
|
||||
"jsonrpc": "2.0",
|
||||
"id": 2,
|
||||
"method": "tools/call",
|
||||
"params": {
|
||||
"name": "catalog.get_object",
|
||||
"arguments": {
|
||||
"run_id": '$RUN_ID',
|
||||
"object_key": "codebase_community_template.users"
|
||||
}
|
||||
}
|
||||
}' | jq .
|
||||
|
||||
echo ""
|
||||
echo "3. Testing llm.summary_upsert..."
|
||||
curl -k -s -X POST "$MCP_ENDPOINT" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"jsonrpc": "2.0",
|
||||
"id": 3,
|
||||
"method": "tools/call",
|
||||
"params": {
|
||||
"name": "llm.summary_upsert",
|
||||
"arguments": {
|
||||
"agent_run_id": 1,
|
||||
"run_id": '$RUN_ID',
|
||||
"object_id": 55,
|
||||
"summary": "{\"hypothesis\":\"Test user table\",\"grain\":\"one row per user\",\"primary_key\":[\"user_id\"],\"time_columns\":[\"created_at\"],\"example_questions\":[\"How many users do we have?\",\"Count users by registration date\"]}",
|
||||
"confidence": 0.9,
|
||||
"status": "stable",
|
||||
"sources": "{\"method\":\"catalog\",\"evidence\":\"schema analysis\"}"
|
||||
}
|
||||
}
|
||||
}' | jq .
|
||||
|
||||
echo ""
|
||||
echo "=== Test Complete ==="
|
||||
echo ""
|
||||
echo "If you saw JSON responses above (not errors), catalog tools are working."
|
||||
echo ""
|
||||
echo "If you see errors or 'isError': true', check the ProxySQL log for details."
|
||||
@ -0,0 +1,534 @@
|
||||
# Global Database Summary - Codebase Community Template
|
||||
## Comprehensive Discovery Report
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The **Codebase Community Template** database is a Stack Overflow-style community Q&A platform containing **8 tables** with approximately **885,000 total records**. This database models a complete question-and-answer ecosystem with user reputation systems, content moderation, voting mechanics, badges/achievements, and comprehensive activity tracking.
|
||||
|
||||
### Key Statistics
|
||||
- **Total Records**: ~885,000 rows across all tables
|
||||
- **Total Tables**: 8 core tables
|
||||
- **Foreign Key Relationships**: 14 documented relationships
|
||||
- **Time Span**: Community activity from 2010 to present
|
||||
- **Core Entities**: Users, Posts, Comments, Votes, Badges, Tags, History, Links
|
||||
|
||||
---
|
||||
|
||||
## Database Purpose and Scope
|
||||
|
||||
This database is designed to track and manage a **technical Q&A community** where:
|
||||
- Users can ask questions and provide answers
|
||||
- Community voting determines content quality
|
||||
- Reputation system rewards valuable contributions
|
||||
- Tags organize content by topic
|
||||
- Badges recognize user achievements
|
||||
- Complete edit history maintains content integrity
|
||||
|
||||
---
|
||||
|
||||
## Core Entities and Relationships
|
||||
|
||||
### 1. **users** (40,325 records)
|
||||
**Purpose**: Central user entity storing authentication, reputation, and profile data
|
||||
|
||||
**Key Attributes**:
|
||||
- `Id`: Primary key (User ID -1 is the system/community account)
|
||||
- `Reputation`: User's reputation score (accumulated through upvotes)
|
||||
- `CreationDate`: When the user account was created
|
||||
- `DisplayName`: Public display name
|
||||
- `Location`: Geographic location
|
||||
- `Views`: Profile view count
|
||||
- `UpVotes`/`DownVotes`: Total votes the user has cast
|
||||
- `AccountId`: Network account ID (for multi-site login)
|
||||
|
||||
**Business Rules**:
|
||||
- Reputation is calculated from upvotes on user's posts
|
||||
- Users can vote (upvote/downvote) on content
|
||||
- Profile views indicate user visibility
|
||||
- Age and website URL are optional demographic data
|
||||
|
||||
---
|
||||
|
||||
### 2. **posts** (91,960 records)
|
||||
**Purpose**: Core content table holding both questions and answers
|
||||
|
||||
**Key Attributes**:
|
||||
- `Id`: Primary key
|
||||
- `PostTypeId`: Discriminator (1 = Question, 2 = Answer)
|
||||
- `ParentId`: For answers, points to the question (self-referencing FK)
|
||||
- `OwnerUserId`: Author of the post
|
||||
- `Title`: Question title (only for PostTypeId = 1)
|
||||
- `Body`: Content (HTML/Markdown)
|
||||
- `Tags`: Tag list (format: `<tag1><tag2><tag3>`)
|
||||
- `Score`: Net vote score (upvotes - downvotes)
|
||||
- `ViewCount`: Number of views (questions only)
|
||||
- `AnswerCount`: Number of answers (questions only)
|
||||
- `AcceptedAnswerId`: ID of the accepted answer (questions only)
|
||||
- `CommentCount`: Number of comments
|
||||
- `FavoriteCount`: Times favorited by users
|
||||
- `CreationDate`: When post was created
|
||||
- `LastActivityDate`: Last edit or comment
|
||||
- `ClosedDate`: If/when question was closed
|
||||
- `CommunityOwnedDate`: If post became community wiki
|
||||
|
||||
**Business Rules**:
|
||||
- Questions have Title, Tags, AnswerCount, ViewCount
|
||||
- Answers have ParentId pointing to question
|
||||
- Posts can be edited (tracked in postHistory)
|
||||
- Questions can have one accepted answer
|
||||
- Posts can become community wikis (no reputation earned)
|
||||
- Posts can be closed by moderators
|
||||
|
||||
**Critical Note**: Column name typo detected: `CreaionDate` should be `CreationDate`
|
||||
|
||||
---
|
||||
|
||||
### 3. **comments** (174,218 records)
|
||||
**Purpose**: Discussion and clarification on posts
|
||||
|
||||
**Key Attributes**:
|
||||
- `Id`: Primary key
|
||||
- `PostId`: Foreign key to posts
|
||||
- `UserId`: Comment author (nullable for anonymous)
|
||||
- `Text`: Comment content
|
||||
- `Score`: Net votes on comment
|
||||
- `CreationDate`: When comment was posted
|
||||
- `UserDisplayName`: Display name for anonymous comments
|
||||
|
||||
**Business Rules**:
|
||||
- Comments can be voted on (score)
|
||||
- Users can delete comments (soft delete)
|
||||
- Anonymous comments allowed (UserId NULL)
|
||||
|
||||
---
|
||||
|
||||
### 4. **votes** (38,930 records)
|
||||
**Purpose**: Records all voting activity on posts
|
||||
|
||||
**Key Attributes**:
|
||||
- `Id`: Primary key
|
||||
- `PostId`: Post being voted on
|
||||
- `VoteTypeId`: Type of vote (2 = UpVote, 3 = DownVote, etc.)
|
||||
- `UserId`: Voter (nullable for anonymous/system votes)
|
||||
- `CreationDate`: When vote was cast
|
||||
- `BountyAmount`: If bounty was awarded
|
||||
|
||||
**Business Rules**:
|
||||
- Users can upvote or downvote posts
|
||||
- Vote affects post's Score
|
||||
- User cannot vote on their own posts
|
||||
- Anonymous votes possible (system/voter privacy)
|
||||
|
||||
---
|
||||
|
||||
### 5. **badges** (79,851 records)
|
||||
**Purpose**: Achievement and gamification system
|
||||
|
||||
**Key Attributes**:
|
||||
- `Id`: Primary key
|
||||
- `UserId`: Badge recipient
|
||||
- `Name`: Badge name (e.g., "Teacher", "Student", "Enlightened")
|
||||
- `Date`: When badge was earned
|
||||
|
||||
**Business Rules**:
|
||||
- Badges are awarded for various achievements
|
||||
- Multiple users can earn the same badge
|
||||
- Users can earn the same badge multiple times (some badge types)
|
||||
|
||||
---
|
||||
|
||||
### 6. **tags** (1,031 records)
|
||||
**Purpose**: Taxonomy system for organizing content
|
||||
|
||||
**Key Attributes**:
|
||||
- `Id`: Primary key
|
||||
- `TagName`: Tag name (unique)
|
||||
- `Count`: Number of questions with this tag
|
||||
- `ExcerptPostId`: Post ID for tag wiki excerpt
|
||||
- `WikiPostId`: Post ID for full tag wiki
|
||||
|
||||
**Business Rules**:
|
||||
- Tags categorize questions by topic
|
||||
- Tag count reflects popularity
|
||||
- Tags have wiki pages for detailed descriptions
|
||||
- Tags can be synonyms (redirects)
|
||||
|
||||
---
|
||||
|
||||
### 7. **postHistory** (303,100 records)
|
||||
**Purpose**: Complete audit trail of all post edits
|
||||
|
||||
**Key Attributes**:
|
||||
- `Id`: Primary key
|
||||
- `PostId`: Post that was edited
|
||||
- `PostHistoryTypeId`: Type of edit (title, body, tags, etc.)
|
||||
- `UserId`: Editor (nullable for system edits)
|
||||
- `CreationDate`: When edit was made
|
||||
- `Text`: New content
|
||||
- `Comment`: Edit reason/comment
|
||||
- `RevisionGUID`: Unique identifier for revision group
|
||||
- `UserDisplayName`: Display name for anonymous edits
|
||||
|
||||
**Business Rules**:
|
||||
- Every edit creates a history record
|
||||
- Multiple edits can be grouped in one revision
|
||||
- Text field contains the new value
|
||||
- Original title/body stored in initial revision
|
||||
|
||||
---
|
||||
|
||||
### 8. **postLinks** (11,098 records)
|
||||
**Purpose**: Relationships between posts (duplicates, related)
|
||||
|
||||
**Key Attributes**:
|
||||
- `Id`: Primary key
|
||||
- `PostId`: Source post
|
||||
- `RelatedPostId`: Target post (linked post)
|
||||
- `LinkTypeId`: Type of link (1 = duplicate, 3 = related)
|
||||
- `CreationDate`: When link was created
|
||||
|
||||
**Business Rules**:
|
||||
- Questions can be marked as duplicates
|
||||
- Users can link related questions
|
||||
- Links are directional (PostId → RelatedPostId)
|
||||
|
||||
---
|
||||
|
||||
## Relationship Map
|
||||
|
||||
### Primary Foreign Key Connections
|
||||
|
||||
```
|
||||
users (1) ────────── (N) posts
|
||||
│ │
|
||||
│ │ (self-ref)
|
||||
│ │
|
||||
├───────── (N) comments │
|
||||
│ │
|
||||
├───────── (N) votes │
|
||||
│ │
|
||||
└───────── (N) badges │
|
||||
│
|
||||
posts (1) ──── (N) comments
|
||||
posts (1) ──── (N) votes
|
||||
posts (1) ──── (N) postHistory
|
||||
posts (1) ──── (N) postLinks (PostId)
|
||||
posts (1) ──── (N) postLinks (RelatedPostId)
|
||||
posts (N) ──── (1) tags (via Tags text field)
|
||||
```
|
||||
|
||||
### Join Patterns
|
||||
|
||||
**1. User with their posts**:
|
||||
```sql
|
||||
users JOIN posts ON users.Id = posts.OwnerUserId
|
||||
```
|
||||
|
||||
**2. Question with its answers**:
|
||||
```sql
|
||||
questions (PostTypeId=1) LEFT JOIN answers (PostTypeId=2)
|
||||
ON questions.Id = answers.ParentId
|
||||
```
|
||||
|
||||
**3. Post with comments and user info**:
|
||||
```sql
|
||||
posts
|
||||
JOIN comments ON posts.Id = comments.PostId
|
||||
JOIN users ON comments.UserId = users.Id
|
||||
```
|
||||
|
||||
**4. Post with votes**:
|
||||
```sql
|
||||
posts JOIN votes ON posts.Id = votes.PostId
|
||||
```
|
||||
|
||||
**5. User's badges**:
|
||||
```sql
|
||||
users JOIN badges ON users.Id = badges.UserId
|
||||
```
|
||||
|
||||
**6. Complete post history**:
|
||||
```sql
|
||||
posts JOIN postHistory ON posts.Id = postHistory.PostId
|
||||
```
|
||||
|
||||
**7. Linked/related posts**:
|
||||
```sql
|
||||
posts AS p1
|
||||
JOIN postLinks ON p1.Id = postLinks.PostId
|
||||
JOIN posts AS p2 ON postLinks.RelatedPostId = p2.Id
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Domain Model (5 Domains)
|
||||
|
||||
### Domain 1: **User Management**
|
||||
**Tables**: `users`
|
||||
**Purpose**: User accounts, authentication, profiles
|
||||
**Key Metrics**: Reputation, profile views, account age, location
|
||||
**Business Questions**:
|
||||
- Who are our top contributors?
|
||||
- What is the user retention rate?
|
||||
- How does reputation distribute across users?
|
||||
|
||||
### Domain 2: **Content Management**
|
||||
**Tables**: `posts`, `postHistory`
|
||||
**Purpose**: Q&A content, revisions, quality tracking
|
||||
**Key Metrics**: Post count, answer rate, acceptance rate, edit frequency
|
||||
**Business Questions**:
|
||||
- What percentage of questions get answered?
|
||||
- How quickly are questions answered?
|
||||
- Which posts are most viewed?
|
||||
|
||||
### Domain 3: **Engagement & Interaction**
|
||||
**Tables**: `votes`, `comments`
|
||||
**Purpose**: Community participation, voting, discussions
|
||||
**Key Metrics**: Vote count, comment rate, engagement score
|
||||
**Business Questions**:
|
||||
- How active is the community?
|
||||
- What is the upvote/downvote ratio?
|
||||
- Which posts generate most discussion?
|
||||
|
||||
### Domain 4: **Recognition & Gamification**
|
||||
**Tables**: `badges`
|
||||
**Purpose**: User achievements, incentives
|
||||
**Key Metrics**: Badges earned, badge types, achievement rate
|
||||
**Business Questions**:
|
||||
- What badges are most common?
|
||||
- Who are the top badge earners?
|
||||
- How do badges correlate with activity?
|
||||
|
||||
### Domain 5: **Content Organization**
|
||||
**Tables**: `tags`, `postLinks`
|
||||
**Purpose**: Taxonomy, categorization, duplicate detection
|
||||
**Key Metrics**: Tag usage, expert identification, duplicate rate
|
||||
**Business Questions**:
|
||||
- What are the most popular tags?
|
||||
- Which tags have most unanswered questions?
|
||||
- Who are the experts for each tag?
|
||||
|
||||
---
|
||||
|
||||
## Key Metrics and KPIs (25 Defined)
|
||||
|
||||
### User Engagement (5 metrics)
|
||||
1. **Active Users** - Users with posts in last 30 days
|
||||
2. **Reputation Distribution** - Percentiles (25th, 50th, 75th, 90th, 99th)
|
||||
3. **User Retention Rate** - % users with multiple posts
|
||||
4. **Top Contributors** - Top 10 by reputation
|
||||
5. **Voting Activity** - Upvote/downvote ratio
|
||||
|
||||
### Content Quality (5 metrics)
|
||||
6. **Question Answer Rate** - % questions with answers
|
||||
7. **Answer Acceptance Rate** - % answered questions with accepted answer
|
||||
8. **Average Response Time** - Hours to first answer (median, p75, p90)
|
||||
9. **Question Closure Rate** - % questions closed
|
||||
10. **Community Wiki Rate** - % posts becoming community wikis
|
||||
|
||||
### Platform Health (5 metrics)
|
||||
11. **Daily Question Volume** - New questions per day
|
||||
12. **Comment Rate** - Average comments per post
|
||||
13. **Vote Velocity** - Votes per post per day
|
||||
14. **Edit Activity** - Post edits per day
|
||||
15. **Badge Acquisition** - Badges earned per day
|
||||
|
||||
### Tag Analytics (5 metrics)
|
||||
16. **Top Tags** - Most frequently used tags
|
||||
17. **Tag Specialization** - Questions and users per tag
|
||||
18. **Unanswered by Tag** - Tags with highest unanswered rate
|
||||
19. **Expertise by Tag** - Top users for each tag
|
||||
20. **Trending Tags** - Fastest growing tags
|
||||
|
||||
### Content Analytics (5 metrics)
|
||||
21. **Most Viewed** - Top questions by views
|
||||
22. **Fastest Answered** - Questions answered most quickly
|
||||
23. **Most Controversial** - Posts with high up/down vote split
|
||||
24. **Most Discussed** - Posts with most comments
|
||||
25. **Answer Quality** - Accepted vs non-accepted answer scores
|
||||
|
||||
---
|
||||
|
||||
## Natural Language Capabilities
|
||||
|
||||
This database can answer **40+ question templates** across 4 categories:
|
||||
|
||||
### User Analytics (10 questions)
|
||||
- "Who are the top users by reputation?"
|
||||
- "What is the activity summary for user X?"
|
||||
- "How many users joined each month?"
|
||||
- "Who are the most active users?"
|
||||
- "What is the answer acceptance rate for users?"
|
||||
|
||||
### Content Analytics (10 questions)
|
||||
- "What are the most viewed questions about Python?"
|
||||
- "What questions have no answers?"
|
||||
- "What are the highest scored posts?"
|
||||
- "How do accepted answers compare to non-accepted?"
|
||||
- "What is the edit history for post X?"
|
||||
|
||||
### Engagement Analytics (10 questions)
|
||||
- "What posts have the most comments?"
|
||||
- "Who are the most active commenters?"
|
||||
- "What is the voting trend?"
|
||||
- "What is the vote distribution for post X?"
|
||||
- "Who are the most active voters?"
|
||||
|
||||
### Tag Analytics (10 questions)
|
||||
- "What are the most popular tags?"
|
||||
- "What questions have both Python and Pandas tags?"
|
||||
- "Who are the top experts for R?"
|
||||
- "What tags have the highest unanswered rate?"
|
||||
- "What tags are commonly used together?"
|
||||
|
||||
---
|
||||
|
||||
## Data Quality Insights
|
||||
|
||||
### Strengths
|
||||
1. **Comprehensive audit trail**: Every edit tracked in postHistory
|
||||
2. **Rich metadata**: Creation dates, scores, view counts on most entities
|
||||
3. **Self-documenting**: Tag wikis, post comments explain content
|
||||
4. **Scalable design**: Normalized structure supports millions of records
|
||||
|
||||
### Known Issues
|
||||
1. **Column typo**: `CreaionDate` instead of `CreationDate` in posts table
|
||||
2. **Nullable FKs**: Some OwnerUserIds can be NULL (anonymous posts)
|
||||
3. **Denormalized tags**: Tags stored as text string, not lookup table
|
||||
4. **Soft deletes**: Comments/posts may be deleted but not removed from tables
|
||||
|
||||
### Data Patterns
|
||||
- **User ID -1**: System/community account
|
||||
- **PostTypeId 1**: Questions
|
||||
- **PostTypeId 2**: Answers
|
||||
- **VoteTypeId 2**: UpVotes
|
||||
- **VoteTypeId 3**: DownVotes
|
||||
- **Tag format**: `<tag1><tag2><tag3>` in XML-like syntax
|
||||
|
||||
---
|
||||
|
||||
## Typical Use Cases
|
||||
|
||||
### 1. Community Health Monitoring
|
||||
```sql
|
||||
-- Daily active users, questions, answers
|
||||
SELECT DATE(CreaionDate), COUNT(DISTINCT OwnerUserId)
|
||||
FROM posts
|
||||
GROUP BY DATE(CreaionDate);
|
||||
```
|
||||
|
||||
### 2. Expert Identification
|
||||
```sql
|
||||
-- Top answerers by tag
|
||||
SELECT u.DisplayName, COUNT(*) as answer_count
|
||||
FROM posts a
|
||||
JOIN posts q ON a.ParentId = q.Id
|
||||
JOIN users u ON a.OwnerUserId = u.Id
|
||||
WHERE q.Tags LIKE '%<python>%'
|
||||
GROUP BY u.DisplayName
|
||||
ORDER BY answer_count DESC;
|
||||
```
|
||||
|
||||
### 3. Content Quality Analysis
|
||||
```sql
|
||||
-- Answer rate by tag
|
||||
SELECT
|
||||
SUBSTRING_INDEX(SUBSTRING_INDEX(Tags, '><', n.n), '>', -1) as tag,
|
||||
AVG(AnswerCount) as avg_answers,
|
||||
SUM(CASE WHEN AnswerCount = 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as unanswered_pct
|
||||
FROM posts
|
||||
CROSS JOIN (SELECT 1 as n UNION ALL SELECT 2 ...) nums
|
||||
WHERE PostTypeId = 1
|
||||
GROUP BY tag;
|
||||
```
|
||||
|
||||
### 4. User Reputation Analytics
|
||||
```sql
|
||||
-- Reputation distribution
|
||||
SELECT
|
||||
NTILE(10) OVER (ORDER BY Reputation) as decile,
|
||||
MIN(Reputation) as min_rep,
|
||||
MAX(Reputation) as max_rep,
|
||||
COUNT(*) as user_count
|
||||
FROM users
|
||||
GROUP BY NTILE(10) OVER (ORDER BY Reputation);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Technical Recommendations
|
||||
|
||||
### For Analytics
|
||||
1. **Create indexes** on: CreationDate, OwnerUserId, PostTypeId, Score
|
||||
2. **Materialize tag relationships** for faster tag-based queries
|
||||
3. **Partition posts** by CreationDate for time-series analysis
|
||||
4. **Create summary tables** for daily/monthly metrics
|
||||
|
||||
### For Application Development
|
||||
1. **Fix column typo**: Rename `CreaionDate` to `CreationDate`
|
||||
2. **Add composite indexes**: (PostTypeId, CreationDate), (OwnerUserId, Score)
|
||||
3. **Consider caching**: User reputation, tag counts (updated periodically)
|
||||
4. **Implement soft deletes**: Track deleted posts with is_deleted flag
|
||||
|
||||
### For Data Science
|
||||
1. **Feature engineering**:
|
||||
- User activity rate (posts/day)
|
||||
- Answer quality score
|
||||
- Tag expertise score
|
||||
- Engagement velocity
|
||||
2. **Predictive modeling**:
|
||||
- Question likelihood of being answered
|
||||
- User churn prediction
|
||||
- Answer acceptance prediction
|
||||
- Trending tag prediction
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The Codebase Community Template database is a **well-structured, comprehensive Q&A platform** that captures all essential aspects of community-driven knowledge sharing. With over 885K records across 8 interconnected tables, it provides rich opportunities for:
|
||||
|
||||
- **User behavior analysis** - Reputation, engagement, retention
|
||||
- **Content quality assessment** - Answer rates, acceptance, views
|
||||
- **Community health monitoring** - Activity trends, voting patterns
|
||||
- **Expertise discovery** - Top contributors by tag/topic
|
||||
- **Platform optimization** - Response times, closure rates
|
||||
|
||||
The database is **production-ready** and suitable for building analytics dashboards, recommendation systems, and community management tools. The 25 defined metrics and 40 question templates provide immediate value for data analysis and natural language query interfaces.
|
||||
|
||||
---
|
||||
|
||||
## Deliverables Summary
|
||||
|
||||
✅ **Database Discovery Complete**
|
||||
|
||||
**Artifacts Created**:
|
||||
1. `/tmp/codebase_community_discovery.md` - Complete technical discovery
|
||||
2. `/tmp/metrics_and_kpis.sql` - 25 production-ready metric queries
|
||||
3. `/tmp/question_templates.md` - 40 NL-to-SQL question templates
|
||||
4. `/tmp/global_database_summary.md` - This comprehensive summary
|
||||
|
||||
**Coverage Achieved**:
|
||||
- ✅ 8 tables fully analyzed and documented
|
||||
- ✅ 14 foreign key relationships mapped
|
||||
- ✅ 5 domains defined with entities and roles
|
||||
- ✅ 25 metrics/KPIs with SQL implementations
|
||||
- ✅ 40 question templates with examples
|
||||
- ✅ Complete join patterns documented
|
||||
- ✅ Data quality insights included
|
||||
|
||||
**Database Statistics**:
|
||||
- Total records: ~885,000
|
||||
- Tables: 8
|
||||
- Relationships: 14 FKs
|
||||
- Time span: 2010-present
|
||||
- Schema: codebase_community_template
|
||||
|
||||
---
|
||||
|
||||
*Discovery completed using MCP catalog tools and direct SQL analysis*
|
||||
*Run ID: 7*
|
||||
*Model: claude-3.5-sonnet*
|
||||
*Date: 2025*
|
||||
File diff suppressed because it is too large
Load Diff
Loading…
Reference in new issue