# StackExchange Posts Processor

A comprehensive script to extract, process, and index StackExchange posts for search capabilities.

## Features

- ✅ **Complete Pipeline**: Extracts parent posts and replies from source database
- 📊 **Search Ready**: Creates full-text search indexes and processed text columns
- 🚀 **Efficient**: Batch processing with memory optimization
- 🔍 **Duplicate Prevention**: Skip already processed posts
- 📈 **Progress Tracking**: Real-time statistics and performance metrics
- 🔧 **Flexible**: Configurable source/target databases
- 📝 **Rich Output**: Structured JSON with tags and metadata

## Database Schema

The script creates a comprehensive target table with these columns:

```sql
processed_posts (
  PostId BIGINT PRIMARY KEY,
  JsonData JSON NOT NULL,                    -- Complete post data
  Embeddings BLOB NULL,                      -- For future ML embeddings
  SearchText LONGTEXT NULL,                  -- Combined text for search
  TitleText VARCHAR(1000) NULL,              -- Cleaned title
  BodyText LONGTEXT NULL,                    -- Cleaned body
  RepliesText LONGTEXT NULL,                -- Combined replies
  Tags JSON NULL,                            -- Extracted tags
  CreatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  UpdatedAt TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

  -- Indexes
  KEY idx_created_at (CreatedAt),
  KEY idx_tags ((CAST(Tags AS CHAR(1000)))),  -- JSON tag index
  FULLTEXT INDEX ft_search (SearchText, TitleText, BodyText, RepliesText)
)
```

## Usage

### Basic Usage

```bash
# Process first 1000 posts
python3 stackexchange_posts.py --limit 1000

# Process with custom batch size
python3 stackexchange_posts.py --limit 10000 --batch-size 500

# Don't skip duplicates (process all posts)
python3 stackexchange_posts.py --limit 1000 --no-skip-duplicates
```

### Advanced Configuration

```bash
# Custom database connections
python3 stackexchange_posts.py \
    --source-host 192.168.1.100 \
    --source-port 3307 \
    --source-user myuser \
    --source-password mypass \
    --source-db my_stackexchange \
    --target-host 192.168.1.200 \
    --target-port 3306 \
    --target-user search_user \
    --target-password search_pass \
    --target-db search_db \
    --limit 50000 \
    --batch-size 1000
```

## Search Examples

Once processed, you can search the data using:

### 1. MySQL Full-Text Search

```sql
-- Basic search
SELECT PostId, Title
FROM processed_posts
WHERE MATCH(SearchText) AGAINST('mysql optimization' IN BOOLEAN MODE)
ORDER BY relevance DESC;

-- Boolean search operators
SELECT PostId, Title
FROM processed_posts
WHERE MATCH(SearchText) AGAINST('+database -oracle' IN BOOLEAN MODE);

-- Proximity search
SELECT PostId, Title
FROM processed_posts
WHERE MATCH(SearchText) AGAINST('"database performance"~5' IN BOOLEAN MODE);
```

### 2. Tag-based Search

```sql
-- Search by specific tags
SELECT PostId, Title
FROM processed_posts
WHERE JSON_CONTAINS(Tags, '"mysql"') AND JSON_CONTAINS(Tags, '"performance"');
```

### 3. Filtered Search

```sql
-- Search within date range
SELECT PostId, Title, JSON_UNQUOTE(JSON_EXTRACT(JsonData, '$.CreationDate')) as CreationDate
FROM processed_posts
WHERE MATCH(SearchText) AGAINST('python' IN BOOLEAN MODE)
AND JSON_UNQUOTE(JSON_EXTRACT(JsonData, '$.CreationDate')) BETWEEN '2023-01-01' AND '2023-12-31';
```

## Performance Tips

1. **Batch Size**: Use larger batches (1000-5000) for better throughput
2. **Memory**: Adjust batch size based on available memory
3. **Indexes**: The script automatically creates necessary indexes
4. **Parallel Processing**: Consider running multiple instances with different offset ranges

## Output Example

```
🚀 StackExchange Posts Processor
==================================================
Source: 127.0.0.1:3306/stackexchange
Target: 127.0.0.1:3306/stackexchange_post
Limit: 1000 posts
Batch size: 100
Skip duplicates: True
==================================================

✅ Connected to source and target databases
✅ Target table created successfully with all search columns

🔄 Processing batch 1 - posts 1 to 100
  ⏭️  Skipping 23 duplicate posts
  📝 Processing 77 posts...
  📊 Batch inserted 77 posts
  ⏱️  Progress: 100/1000 posts (10.0%)
  📈 Total processed: 77, Inserted: 77, Skipped: 23
  ⚡ Rate: 12.3 posts/sec

🎉 Processing complete!
   📊 Total batches: 10
   📝 Total processed: 800
   ✅ Total inserted: 800
   ⏭️  Total skipped: 200
   ⏱️  Total time: 45.2 seconds
   🚀 Average rate: 17.7 posts/sec

✅ Processing completed successfully!
```

## Troubleshooting

### Common Issues

1. **Table Creation Failed**: Check database permissions
2. **Memory Issues**: Reduce batch size
3. **Slow Performance**: Optimize MySQL configuration
4. **Connection Errors**: Verify database credentials

### Maintenance

```sql
-- Check table status
SHOW TABLE STATUS LIKE 'processed_posts';

-- Rebuild full-text index
ALTER TABLE processed_posts DROP INDEX ft_search,
  ADD FULLTEXT INDEX ft_search (SearchText, TitleText, BodyText, RepliesText);

-- Count processed posts
SELECT COUNT(*) FROM processed_posts;
```

## Requirements

- Python 3.7+
- mysql-connector-python
- MySQL 5.7+ (for JSON and full-text support)

Install dependencies:
```bash
pip install mysql-connector-python
```

## Other Scripts

The `scripts/` directory also contains other utility scripts:

- `nlp_search_demo.py` - Demonstrate various search techniques on processed posts:
  - Full-text search with MySQL
  - Boolean search with operators
  - Tag-based JSON queries
  - Combined search approaches
  - Statistics and search analytics
  - Data preparation for future semantic search

- `add_mysql_user.sh` - Add/replace MySQL users in ProxySQL
- `change_host_status.sh` - Change host status in ProxySQL
- `flush_query_cache.sh` - Flush ProxySQL query cache
- `kill_idle_backend_conns.py` - Kill idle backend connections
- `proxysql_config.sh` - Configure ProxySQL settings
- `stats_scrapper.py` - Scrape statistics from ProxySQL

## Search Examples

### Using the NLP Search Demo

```bash
# Show search statistics
python3 nlp_search_demo.py --mode stats

# Full-text search
python3 nlp_search_demo.py --mode full-text --query "mysql performance optimization"

# Boolean search with operators
python3 nlp_search_demo.py --mode boolean --query "+database -oracle"

# Search by tags
python3 nlp_search_demo.py --mode tags --tags mysql performance --operator AND

# Combined search with text and tags
python3 nlp_search_demo.py --mode combined --query "python optimization" --tags python

# Prepare data for semantic search
python3 nlp_search_demo.py --mode similarity --query "machine learning"
```

## License

Internal use only.