mirror of https://github.com/sysown/proxysql
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
351 lines
13 KiB
351 lines
13 KiB
# RAG Subsystem Doxygen Documentation
|
|
|
|
## Overview
|
|
|
|
The RAG (Retrieval-Augmented Generation) subsystem provides a comprehensive set of tools for semantic search and document retrieval through the MCP (Model Context Protocol). This documentation details the Doxygen-style comments added to the RAG implementation.
|
|
|
|
## Main Classes
|
|
|
|
### RAG_Tool_Handler
|
|
|
|
The primary class that implements all RAG functionality through the MCP protocol.
|
|
|
|
#### Class Definition
|
|
```cpp
|
|
class RAG_Tool_Handler : public MCP_Tool_Handler
|
|
```
|
|
|
|
#### Constructor
|
|
```cpp
|
|
/**
|
|
* @brief Constructor
|
|
* @param ai_mgr Pointer to AI_Features_Manager for database access and configuration
|
|
*
|
|
* Initializes the RAG tool handler with configuration parameters from GenAI_Thread
|
|
* if available, otherwise uses default values.
|
|
*
|
|
* Configuration parameters:
|
|
* - k_max: Maximum number of search results (default: 50)
|
|
* - candidates_max: Maximum number of candidates for hybrid search (default: 500)
|
|
* - query_max_bytes: Maximum query length in bytes (default: 8192)
|
|
* - response_max_bytes: Maximum response size in bytes (default: 5000000)
|
|
* - timeout_ms: Operation timeout in milliseconds (default: 2000)
|
|
*/
|
|
RAG_Tool_Handler(AI_Features_Manager* ai_mgr);
|
|
```
|
|
|
|
#### Public Methods
|
|
|
|
##### get_tool_list()
|
|
```cpp
|
|
/**
|
|
* @brief Get list of available RAG tools
|
|
* @return JSON object containing tool definitions and schemas
|
|
*
|
|
* Returns a comprehensive list of all available RAG tools with their
|
|
* input schemas and descriptions. Tools include:
|
|
* - rag.search_fts: Keyword search using FTS5
|
|
* - rag.search_vector: Semantic search using vector embeddings
|
|
* - rag.search_hybrid: Hybrid search combining FTS and vectors
|
|
* - rag.get_chunks: Fetch chunk content by chunk_id
|
|
* - rag.get_docs: Fetch document content by doc_id
|
|
* - rag.fetch_from_source: Refetch authoritative data from source
|
|
* - rag.admin.stats: Operational statistics
|
|
*/
|
|
json get_tool_list() override;
|
|
```
|
|
|
|
##### execute_tool()
|
|
```cpp
|
|
/**
|
|
* @brief Execute a RAG tool with arguments
|
|
* @param tool_name Name of the tool to execute
|
|
* @param arguments JSON object containing tool arguments
|
|
* @return JSON response with results or error information
|
|
*
|
|
* Executes the specified RAG tool with the provided arguments. Handles
|
|
* input validation, parameter processing, database queries, and result
|
|
* formatting according to MCP specifications.
|
|
*
|
|
* Supported tools:
|
|
* - rag.search_fts: Full-text search over documents
|
|
* - rag.search_vector: Vector similarity search
|
|
* - rag.search_hybrid: Hybrid search with two modes (fuse, fts_then_vec)
|
|
* - rag.get_chunks: Retrieve chunk content by ID
|
|
* - rag.get_docs: Retrieve document content by ID
|
|
* - rag.fetch_from_source: Refetch data from authoritative source
|
|
* - rag.admin.stats: Get operational statistics
|
|
*/
|
|
json execute_tool(const std::string& tool_name, const json& arguments) override;
|
|
```
|
|
|
|
#### Private Helper Methods
|
|
|
|
##### Database and Query Helpers
|
|
|
|
```cpp
|
|
/**
|
|
* @brief Execute database query and return results
|
|
* @param query SQL query string to execute
|
|
* @return SQLite3_result pointer or NULL on error
|
|
*
|
|
* Executes a SQL query against the vector database and returns the results.
|
|
* Handles error checking and logging. The caller is responsible for freeing
|
|
* the returned SQLite3_result.
|
|
*/
|
|
SQLite3_result* execute_query(const char* query);
|
|
|
|
/**
|
|
* @brief Validate and limit k parameter
|
|
* @param k Requested number of results
|
|
* @return Validated k value within configured limits
|
|
*
|
|
* Ensures the k parameter is within acceptable bounds (1 to k_max).
|
|
* Returns default value of 10 if k is invalid.
|
|
*/
|
|
int validate_k(int k);
|
|
|
|
/**
|
|
* @brief Validate and limit candidates parameter
|
|
* @param candidates Requested number of candidates
|
|
* @return Validated candidates value within configured limits
|
|
*
|
|
* Ensures the candidates parameter is within acceptable bounds (1 to candidates_max).
|
|
* Returns default value of 50 if candidates is invalid.
|
|
*/
|
|
int validate_candidates(int candidates);
|
|
|
|
/**
|
|
* @brief Validate query length
|
|
* @param query Query string to validate
|
|
* @return true if query is within length limits, false otherwise
|
|
*
|
|
* Checks if the query string length is within the configured query_max_bytes limit.
|
|
*/
|
|
bool validate_query_length(const std::string& query);
|
|
```
|
|
|
|
##### JSON Parameter Extraction
|
|
|
|
```cpp
|
|
/**
|
|
* @brief Extract string parameter from JSON
|
|
* @param j JSON object to extract from
|
|
* @param key Parameter key to extract
|
|
* @param default_val Default value if key not found
|
|
* @return Extracted string value or default
|
|
*
|
|
* Safely extracts a string parameter from a JSON object, handling type
|
|
* conversion if necessary. Returns the default value if the key is not
|
|
* found or cannot be converted to a string.
|
|
*/
|
|
static std::string get_json_string(const json& j, const std::string& key,
|
|
const std::string& default_val = "");
|
|
|
|
/**
|
|
* @brief Extract int parameter from JSON
|
|
* @param j JSON object to extract from
|
|
* @param key Parameter key to extract
|
|
* @param default_val Default value if key not found
|
|
* @return Extracted int value or default
|
|
*
|
|
* Safely extracts an integer parameter from a JSON object, handling type
|
|
* conversion from string if necessary. Returns the default value if the
|
|
* key is not found or cannot be converted to an integer.
|
|
*/
|
|
static int get_json_int(const json& j, const std::string& key, int default_val = 0);
|
|
|
|
/**
|
|
* @brief Extract bool parameter from JSON
|
|
* @param j JSON object to extract from
|
|
* @param key Parameter key to extract
|
|
* @param default_val Default value if key not found
|
|
* @return Extracted bool value or default
|
|
*
|
|
* Safely extracts a boolean parameter from a JSON object, handling type
|
|
* conversion from string or integer if necessary. Returns the default
|
|
* value if the key is not found or cannot be converted to a boolean.
|
|
*/
|
|
static bool get_json_bool(const json& j, const std::string& key, bool default_val = false);
|
|
|
|
/**
|
|
* @brief Extract string array from JSON
|
|
* @param j JSON object to extract from
|
|
* @param key Parameter key to extract
|
|
* @return Vector of extracted strings
|
|
*
|
|
* Safely extracts a string array parameter from a JSON object, filtering
|
|
* out non-string elements. Returns an empty vector if the key is not
|
|
* found or is not an array.
|
|
*/
|
|
static std::vector<std::string> get_json_string_array(const json& j, const std::string& key);
|
|
|
|
/**
|
|
* @brief Extract int array from JSON
|
|
* @param j JSON object to extract from
|
|
* @param key Parameter key to extract
|
|
* @return Vector of extracted integers
|
|
*
|
|
* Safely extracts an integer array parameter from a JSON object, handling
|
|
* type conversion from string if necessary. Returns an empty vector if
|
|
* the key is not found or is not an array.
|
|
*/
|
|
static std::vector<int> get_json_int_array(const json& j, const std::string& key);
|
|
```
|
|
|
|
##### Scoring and Normalization
|
|
|
|
```cpp
|
|
/**
|
|
* @brief Compute Reciprocal Rank Fusion score
|
|
* @param rank Rank position (1-based)
|
|
* @param k0 Smoothing parameter
|
|
* @param weight Weight factor for this ranking
|
|
* @return RRF score
|
|
*
|
|
* Computes the Reciprocal Rank Fusion score for hybrid search ranking.
|
|
* Formula: weight / (k0 + rank)
|
|
*/
|
|
double compute_rrf_score(int rank, int k0, double weight);
|
|
|
|
/**
|
|
* @brief Normalize scores to 0-1 range (higher is better)
|
|
* @param score Raw score to normalize
|
|
* @param score_type Type of score being normalized
|
|
* @return Normalized score in 0-1 range
|
|
*
|
|
* Normalizes various types of scores to a consistent 0-1 range where
|
|
* higher values indicate better matches. Different score types may
|
|
* require different normalization approaches.
|
|
*/
|
|
double normalize_score(double score, const std::string& score_type);
|
|
```
|
|
|
|
## Tool Specifications
|
|
|
|
### rag.search_fts
|
|
Keyword search over documents using FTS5.
|
|
|
|
#### Parameters
|
|
- `query` (string, required): Search query string
|
|
- `k` (integer): Number of results to return (default: 10, max: 50)
|
|
- `offset` (integer): Offset for pagination (default: 0)
|
|
- `filters` (object): Filter criteria for results
|
|
- `return` (object): Return options for result fields
|
|
|
|
#### Filters
|
|
- `source_ids` (array of integers): Filter by source IDs
|
|
- `source_names` (array of strings): Filter by source names
|
|
- `doc_ids` (array of strings): Filter by document IDs
|
|
- `min_score` (number): Minimum score threshold
|
|
- `post_type_ids` (array of integers): Filter by post type IDs
|
|
- `tags_any` (array of strings): Filter by any of these tags
|
|
- `tags_all` (array of strings): Filter by all of these tags
|
|
- `created_after` (string): Filter by creation date (after)
|
|
- `created_before` (string): Filter by creation date (before)
|
|
|
|
#### Return Options
|
|
- `include_title` (boolean): Include title in results (default: true)
|
|
- `include_metadata` (boolean): Include metadata in results (default: true)
|
|
- `include_snippets` (boolean): Include snippets in results (default: false)
|
|
|
|
### rag.search_vector
|
|
Semantic search over documents using vector embeddings.
|
|
|
|
#### Parameters
|
|
- `query_text` (string, required): Text to search semantically
|
|
- `k` (integer): Number of results to return (default: 10, max: 50)
|
|
- `filters` (object): Filter criteria for results
|
|
- `embedding` (object): Embedding model specification
|
|
- `query_embedding` (object): Precomputed query embedding
|
|
- `return` (object): Return options for result fields
|
|
|
|
### rag.search_hybrid
|
|
Hybrid search combining FTS and vector search.
|
|
|
|
#### Parameters
|
|
- `query` (string, required): Search query for both FTS and vector
|
|
- `k` (integer): Number of results to return (default: 10, max: 50)
|
|
- `mode` (string): Search mode: 'fuse' or 'fts_then_vec'
|
|
- `filters` (object): Filter criteria for results
|
|
- `fuse` (object): Parameters for fuse mode
|
|
- `fts_then_vec` (object): Parameters for fts_then_vec mode
|
|
|
|
#### Fuse Mode Parameters
|
|
- `fts_k` (integer): Number of FTS results for fusion (default: 50)
|
|
- `vec_k` (integer): Number of vector results for fusion (default: 50)
|
|
- `rrf_k0` (integer): RRF smoothing parameter (default: 60)
|
|
- `w_fts` (number): Weight for FTS scores (default: 1.0)
|
|
- `w_vec` (number): Weight for vector scores (default: 1.0)
|
|
|
|
#### FTS Then Vector Mode Parameters
|
|
- `candidates_k` (integer): FTS candidates to generate (default: 200)
|
|
- `rerank_k` (integer): Candidates to rerank with vector search (default: 50)
|
|
- `vec_metric` (string): Vector similarity metric (default: 'cosine')
|
|
|
|
### rag.get_chunks
|
|
Fetch chunk content by chunk_id.
|
|
|
|
#### Parameters
|
|
- `chunk_ids` (array of strings, required): List of chunk IDs to fetch
|
|
- `return` (object): Return options for result fields
|
|
|
|
### rag.get_docs
|
|
Fetch document content by doc_id.
|
|
|
|
#### Parameters
|
|
- `doc_ids` (array of strings, required): List of document IDs to fetch
|
|
- `return` (object): Return options for result fields
|
|
|
|
### rag.fetch_from_source
|
|
Refetch authoritative data from source database.
|
|
|
|
#### Parameters
|
|
- `doc_ids` (array of strings, required): List of document IDs to refetch
|
|
- `columns` (array of strings): List of columns to fetch
|
|
- `limits` (object): Limits for the fetch operation
|
|
|
|
### rag.admin.stats
|
|
Get operational statistics for RAG system.
|
|
|
|
#### Parameters
|
|
None
|
|
|
|
## Database Schema
|
|
|
|
The RAG subsystem uses the following tables in the vector database:
|
|
|
|
1. `rag_sources`: Ingestion configuration and source metadata
|
|
2. `rag_documents`: Canonical documents with stable IDs
|
|
3. `rag_chunks`: Chunked content for retrieval
|
|
4. `rag_fts_chunks`: FTS5 contentless index for keyword search
|
|
5. `rag_vec_chunks`: sqlite3-vec virtual table for vector similarity search
|
|
6. `rag_sync_state`: Sync state tracking for incremental ingestion
|
|
7. `rag_chunk_view`: Convenience view for debugging
|
|
|
|
## Security Features
|
|
|
|
1. **Input Validation**: Strict validation of all parameters and filters
|
|
2. **Query Limits**: Maximum limits on query length, result count, and candidates
|
|
3. **Timeouts**: Configurable operation timeouts to prevent resource exhaustion
|
|
4. **Column Whitelisting**: Strict column filtering for refetch operations
|
|
5. **Row and Byte Limits**: Maximum limits on returned data size
|
|
6. **Parameter Binding**: Safe parameter binding to prevent SQL injection
|
|
|
|
## Performance Features
|
|
|
|
1. **Prepared Statements**: Efficient query execution with prepared statements
|
|
2. **Connection Management**: Proper database connection handling
|
|
3. **SQLite3-vec Integration**: Optimized vector operations
|
|
4. **FTS5 Integration**: Efficient full-text search capabilities
|
|
5. **Indexing Strategies**: Proper database indexing for performance
|
|
6. **Result Caching**: Efficient result processing and formatting
|
|
|
|
## Configuration Variables
|
|
|
|
1. `genai_rag_enabled`: Enable RAG features
|
|
2. `genai_rag_k_max`: Maximum k for search results (default: 50)
|
|
3. `genai_rag_candidates_max`: Maximum candidates for hybrid search (default: 500)
|
|
4. `genai_rag_query_max_bytes`: Maximum query length in bytes (default: 8192)
|
|
5. `genai_rag_response_max_bytes`: Maximum response size in bytes (default: 5000000)
|
|
6. `genai_rag_timeout_ms`: RAG operation timeout in ms (default: 2000) |