You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/doc/rag-doxygen-documentation.md

351 lines
13 KiB

# RAG Subsystem Doxygen Documentation
## Overview
The RAG (Retrieval-Augmented Generation) subsystem provides a comprehensive set of tools for semantic search and document retrieval through the MCP (Model Context Protocol). This documentation details the Doxygen-style comments added to the RAG implementation.
## Main Classes
### RAG_Tool_Handler
The primary class that implements all RAG functionality through the MCP protocol.
#### Class Definition
```cpp
class RAG_Tool_Handler : public MCP_Tool_Handler
```
#### Constructor
```cpp
/**
* @brief Constructor
* @param ai_mgr Pointer to AI_Features_Manager for database access and configuration
*
* Initializes the RAG tool handler with configuration parameters from GenAI_Thread
* if available, otherwise uses default values.
*
* Configuration parameters:
* - k_max: Maximum number of search results (default: 50)
* - candidates_max: Maximum number of candidates for hybrid search (default: 500)
* - query_max_bytes: Maximum query length in bytes (default: 8192)
* - response_max_bytes: Maximum response size in bytes (default: 5000000)
* - timeout_ms: Operation timeout in milliseconds (default: 2000)
*/
RAG_Tool_Handler(AI_Features_Manager* ai_mgr);
```
#### Public Methods
##### get_tool_list()
```cpp
/**
* @brief Get list of available RAG tools
* @return JSON object containing tool definitions and schemas
*
* Returns a comprehensive list of all available RAG tools with their
* input schemas and descriptions. Tools include:
* - rag.search_fts: Keyword search using FTS5
* - rag.search_vector: Semantic search using vector embeddings
* - rag.search_hybrid: Hybrid search combining FTS and vectors
* - rag.get_chunks: Fetch chunk content by chunk_id
* - rag.get_docs: Fetch document content by doc_id
* - rag.fetch_from_source: Refetch authoritative data from source
* - rag.admin.stats: Operational statistics
*/
json get_tool_list() override;
```
##### execute_tool()
```cpp
/**
* @brief Execute a RAG tool with arguments
* @param tool_name Name of the tool to execute
* @param arguments JSON object containing tool arguments
* @return JSON response with results or error information
*
* Executes the specified RAG tool with the provided arguments. Handles
* input validation, parameter processing, database queries, and result
* formatting according to MCP specifications.
*
* Supported tools:
* - rag.search_fts: Full-text search over documents
* - rag.search_vector: Vector similarity search
* - rag.search_hybrid: Hybrid search with two modes (fuse, fts_then_vec)
* - rag.get_chunks: Retrieve chunk content by ID
* - rag.get_docs: Retrieve document content by ID
* - rag.fetch_from_source: Refetch data from authoritative source
* - rag.admin.stats: Get operational statistics
*/
json execute_tool(const std::string& tool_name, const json& arguments) override;
```
#### Private Helper Methods
##### Database and Query Helpers
```cpp
/**
* @brief Execute database query and return results
* @param query SQL query string to execute
* @return SQLite3_result pointer or NULL on error
*
* Executes a SQL query against the vector database and returns the results.
* Handles error checking and logging. The caller is responsible for freeing
* the returned SQLite3_result.
*/
SQLite3_result* execute_query(const char* query);
/**
* @brief Validate and limit k parameter
* @param k Requested number of results
* @return Validated k value within configured limits
*
* Ensures the k parameter is within acceptable bounds (1 to k_max).
* Returns default value of 10 if k is invalid.
*/
int validate_k(int k);
/**
* @brief Validate and limit candidates parameter
* @param candidates Requested number of candidates
* @return Validated candidates value within configured limits
*
* Ensures the candidates parameter is within acceptable bounds (1 to candidates_max).
* Returns default value of 50 if candidates is invalid.
*/
int validate_candidates(int candidates);
/**
* @brief Validate query length
* @param query Query string to validate
* @return true if query is within length limits, false otherwise
*
* Checks if the query string length is within the configured query_max_bytes limit.
*/
bool validate_query_length(const std::string& query);
```
##### JSON Parameter Extraction
```cpp
/**
* @brief Extract string parameter from JSON
* @param j JSON object to extract from
* @param key Parameter key to extract
* @param default_val Default value if key not found
* @return Extracted string value or default
*
* Safely extracts a string parameter from a JSON object, handling type
* conversion if necessary. Returns the default value if the key is not
* found or cannot be converted to a string.
*/
static std::string get_json_string(const json& j, const std::string& key,
const std::string& default_val = "");
/**
* @brief Extract int parameter from JSON
* @param j JSON object to extract from
* @param key Parameter key to extract
* @param default_val Default value if key not found
* @return Extracted int value or default
*
* Safely extracts an integer parameter from a JSON object, handling type
* conversion from string if necessary. Returns the default value if the
* key is not found or cannot be converted to an integer.
*/
static int get_json_int(const json& j, const std::string& key, int default_val = 0);
/**
* @brief Extract bool parameter from JSON
* @param j JSON object to extract from
* @param key Parameter key to extract
* @param default_val Default value if key not found
* @return Extracted bool value or default
*
* Safely extracts a boolean parameter from a JSON object, handling type
* conversion from string or integer if necessary. Returns the default
* value if the key is not found or cannot be converted to a boolean.
*/
static bool get_json_bool(const json& j, const std::string& key, bool default_val = false);
/**
* @brief Extract string array from JSON
* @param j JSON object to extract from
* @param key Parameter key to extract
* @return Vector of extracted strings
*
* Safely extracts a string array parameter from a JSON object, filtering
* out non-string elements. Returns an empty vector if the key is not
* found or is not an array.
*/
static std::vector<std::string> get_json_string_array(const json& j, const std::string& key);
/**
* @brief Extract int array from JSON
* @param j JSON object to extract from
* @param key Parameter key to extract
* @return Vector of extracted integers
*
* Safely extracts an integer array parameter from a JSON object, handling
* type conversion from string if necessary. Returns an empty vector if
* the key is not found or is not an array.
*/
static std::vector<int> get_json_int_array(const json& j, const std::string& key);
```
##### Scoring and Normalization
```cpp
/**
* @brief Compute Reciprocal Rank Fusion score
* @param rank Rank position (1-based)
* @param k0 Smoothing parameter
* @param weight Weight factor for this ranking
* @return RRF score
*
* Computes the Reciprocal Rank Fusion score for hybrid search ranking.
* Formula: weight / (k0 + rank)
*/
double compute_rrf_score(int rank, int k0, double weight);
/**
* @brief Normalize scores to 0-1 range (higher is better)
* @param score Raw score to normalize
* @param score_type Type of score being normalized
* @return Normalized score in 0-1 range
*
* Normalizes various types of scores to a consistent 0-1 range where
* higher values indicate better matches. Different score types may
* require different normalization approaches.
*/
double normalize_score(double score, const std::string& score_type);
```
## Tool Specifications
### rag.search_fts
Keyword search over documents using FTS5.
#### Parameters
- `query` (string, required): Search query string
- `k` (integer): Number of results to return (default: 10, max: 50)
- `offset` (integer): Offset for pagination (default: 0)
- `filters` (object): Filter criteria for results
- `return` (object): Return options for result fields
#### Filters
- `source_ids` (array of integers): Filter by source IDs
- `source_names` (array of strings): Filter by source names
- `doc_ids` (array of strings): Filter by document IDs
- `min_score` (number): Minimum score threshold
- `post_type_ids` (array of integers): Filter by post type IDs
- `tags_any` (array of strings): Filter by any of these tags
- `tags_all` (array of strings): Filter by all of these tags
- `created_after` (string): Filter by creation date (after)
- `created_before` (string): Filter by creation date (before)
#### Return Options
- `include_title` (boolean): Include title in results (default: true)
- `include_metadata` (boolean): Include metadata in results (default: true)
- `include_snippets` (boolean): Include snippets in results (default: false)
### rag.search_vector
Semantic search over documents using vector embeddings.
#### Parameters
- `query_text` (string, required): Text to search semantically
- `k` (integer): Number of results to return (default: 10, max: 50)
- `filters` (object): Filter criteria for results
- `embedding` (object): Embedding model specification
- `query_embedding` (object): Precomputed query embedding
- `return` (object): Return options for result fields
### rag.search_hybrid
Hybrid search combining FTS and vector search.
#### Parameters
- `query` (string, required): Search query for both FTS and vector
- `k` (integer): Number of results to return (default: 10, max: 50)
- `mode` (string): Search mode: 'fuse' or 'fts_then_vec'
- `filters` (object): Filter criteria for results
- `fuse` (object): Parameters for fuse mode
- `fts_then_vec` (object): Parameters for fts_then_vec mode
#### Fuse Mode Parameters
- `fts_k` (integer): Number of FTS results for fusion (default: 50)
- `vec_k` (integer): Number of vector results for fusion (default: 50)
- `rrf_k0` (integer): RRF smoothing parameter (default: 60)
- `w_fts` (number): Weight for FTS scores (default: 1.0)
- `w_vec` (number): Weight for vector scores (default: 1.0)
#### FTS Then Vector Mode Parameters
- `candidates_k` (integer): FTS candidates to generate (default: 200)
- `rerank_k` (integer): Candidates to rerank with vector search (default: 50)
- `vec_metric` (string): Vector similarity metric (default: 'cosine')
### rag.get_chunks
Fetch chunk content by chunk_id.
#### Parameters
- `chunk_ids` (array of strings, required): List of chunk IDs to fetch
- `return` (object): Return options for result fields
### rag.get_docs
Fetch document content by doc_id.
#### Parameters
- `doc_ids` (array of strings, required): List of document IDs to fetch
- `return` (object): Return options for result fields
### rag.fetch_from_source
Refetch authoritative data from source database.
#### Parameters
- `doc_ids` (array of strings, required): List of document IDs to refetch
- `columns` (array of strings): List of columns to fetch
- `limits` (object): Limits for the fetch operation
### rag.admin.stats
Get operational statistics for RAG system.
#### Parameters
None
## Database Schema
The RAG subsystem uses the following tables in the vector database:
1. `rag_sources`: Ingestion configuration and source metadata
2. `rag_documents`: Canonical documents with stable IDs
3. `rag_chunks`: Chunked content for retrieval
4. `rag_fts_chunks`: FTS5 contentless index for keyword search
5. `rag_vec_chunks`: sqlite3-vec virtual table for vector similarity search
6. `rag_sync_state`: Sync state tracking for incremental ingestion
7. `rag_chunk_view`: Convenience view for debugging
## Security Features
1. **Input Validation**: Strict validation of all parameters and filters
2. **Query Limits**: Maximum limits on query length, result count, and candidates
3. **Timeouts**: Configurable operation timeouts to prevent resource exhaustion
4. **Column Whitelisting**: Strict column filtering for refetch operations
5. **Row and Byte Limits**: Maximum limits on returned data size
6. **Parameter Binding**: Safe parameter binding to prevent SQL injection
## Performance Features
1. **Prepared Statements**: Efficient query execution with prepared statements
2. **Connection Management**: Proper database connection handling
3. **SQLite3-vec Integration**: Optimized vector operations
4. **FTS5 Integration**: Efficient full-text search capabilities
5. **Indexing Strategies**: Proper database indexing for performance
6. **Result Caching**: Efficient result processing and formatting
## Configuration Variables
1. `genai_rag_enabled`: Enable RAG features
2. `genai_rag_k_max`: Maximum k for search results (default: 50)
3. `genai_rag_candidates_max`: Maximum candidates for hybrid search (default: 500)
4. `genai_rag_query_max_bytes`: Maximum query length in bytes (default: 8192)
5. `genai_rag_response_max_bytes`: Maximum response size in bytes (default: 5000000)
6. `genai_rag_timeout_ms`: RAG operation timeout in ms (default: 2000)