You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
proxysql/doc/rag-doxygen-documentation.md

13 KiB

RAG Subsystem Doxygen Documentation

Overview

The RAG (Retrieval-Augmented Generation) subsystem provides a comprehensive set of tools for semantic search and document retrieval through the MCP (Model Context Protocol). This documentation details the Doxygen-style comments added to the RAG implementation.

Main Classes

RAG_Tool_Handler

The primary class that implements all RAG functionality through the MCP protocol.

Class Definition

class RAG_Tool_Handler : public MCP_Tool_Handler

Constructor

/**
 * @brief Constructor
 * @param ai_mgr Pointer to AI_Features_Manager for database access and configuration
 *
 * Initializes the RAG tool handler with configuration parameters from GenAI_Thread
 * if available, otherwise uses default values.
 *
 * Configuration parameters:
 * - k_max: Maximum number of search results (default: 50)
 * - candidates_max: Maximum number of candidates for hybrid search (default: 500)
 * - query_max_bytes: Maximum query length in bytes (default: 8192)
 * - response_max_bytes: Maximum response size in bytes (default: 5000000)
 * - timeout_ms: Operation timeout in milliseconds (default: 2000)
 */
RAG_Tool_Handler(AI_Features_Manager* ai_mgr);

Public Methods

get_tool_list()
/**
 * @brief Get list of available RAG tools
 * @return JSON object containing tool definitions and schemas
 *
 * Returns a comprehensive list of all available RAG tools with their
 * input schemas and descriptions. Tools include:
 * - rag.search_fts: Keyword search using FTS5
 * - rag.search_vector: Semantic search using vector embeddings
 * - rag.search_hybrid: Hybrid search combining FTS and vectors
 * - rag.get_chunks: Fetch chunk content by chunk_id
 * - rag.get_docs: Fetch document content by doc_id
 * - rag.fetch_from_source: Refetch authoritative data from source
 * - rag.admin.stats: Operational statistics
 */
json get_tool_list() override;
execute_tool()
/**
 * @brief Execute a RAG tool with arguments
 * @param tool_name Name of the tool to execute
 * @param arguments JSON object containing tool arguments
 * @return JSON response with results or error information
 *
 * Executes the specified RAG tool with the provided arguments. Handles
 * input validation, parameter processing, database queries, and result
 * formatting according to MCP specifications.
 *
 * Supported tools:
 * - rag.search_fts: Full-text search over documents
 * - rag.search_vector: Vector similarity search
 * - rag.search_hybrid: Hybrid search with two modes (fuse, fts_then_vec)
 * - rag.get_chunks: Retrieve chunk content by ID
 * - rag.get_docs: Retrieve document content by ID
 * - rag.fetch_from_source: Refetch data from authoritative source
 * - rag.admin.stats: Get operational statistics
 */
json execute_tool(const std::string& tool_name, const json& arguments) override;

Private Helper Methods

Database and Query Helpers
/**
 * @brief Execute database query and return results
 * @param query SQL query string to execute
 * @return SQLite3_result pointer or NULL on error
 *
 * Executes a SQL query against the vector database and returns the results.
 * Handles error checking and logging. The caller is responsible for freeing
 * the returned SQLite3_result.
 */
SQLite3_result* execute_query(const char* query);

/**
 * @brief Validate and limit k parameter
 * @param k Requested number of results
 * @return Validated k value within configured limits
 *
 * Ensures the k parameter is within acceptable bounds (1 to k_max).
 * Returns default value of 10 if k is invalid.
 */
int validate_k(int k);

/**
 * @brief Validate and limit candidates parameter
 * @param candidates Requested number of candidates
 * @return Validated candidates value within configured limits
 *
 * Ensures the candidates parameter is within acceptable bounds (1 to candidates_max).
 * Returns default value of 50 if candidates is invalid.
 */
int validate_candidates(int candidates);

/**
 * @brief Validate query length
 * @param query Query string to validate
 * @return true if query is within length limits, false otherwise
 *
 * Checks if the query string length is within the configured query_max_bytes limit.
 */
bool validate_query_length(const std::string& query);
JSON Parameter Extraction
/**
 * @brief Extract string parameter from JSON
 * @param j JSON object to extract from
 * @param key Parameter key to extract
 * @param default_val Default value if key not found
 * @return Extracted string value or default
 *
 * Safely extracts a string parameter from a JSON object, handling type
 * conversion if necessary. Returns the default value if the key is not
 * found or cannot be converted to a string.
 */
static std::string get_json_string(const json& j, const std::string& key,
                                   const std::string& default_val = "");

/**
 * @brief Extract int parameter from JSON
 * @param j JSON object to extract from
 * @param key Parameter key to extract
 * @param default_val Default value if key not found
 * @return Extracted int value or default
 *
 * Safely extracts an integer parameter from a JSON object, handling type
 * conversion from string if necessary. Returns the default value if the
 * key is not found or cannot be converted to an integer.
 */
static int get_json_int(const json& j, const std::string& key, int default_val = 0);

/**
 * @brief Extract bool parameter from JSON
 * @param j JSON object to extract from
 * @param key Parameter key to extract
 * @param default_val Default value if key not found
 * @return Extracted bool value or default
 *
 * Safely extracts a boolean parameter from a JSON object, handling type
 * conversion from string or integer if necessary. Returns the default
 * value if the key is not found or cannot be converted to a boolean.
 */
static bool get_json_bool(const json& j, const std::string& key, bool default_val = false);

/**
 * @brief Extract string array from JSON
 * @param j JSON object to extract from
 * @param key Parameter key to extract
 * @return Vector of extracted strings
 *
 * Safely extracts a string array parameter from a JSON object, filtering
 * out non-string elements. Returns an empty vector if the key is not
 * found or is not an array.
 */
static std::vector<std::string> get_json_string_array(const json& j, const std::string& key);

/**
 * @brief Extract int array from JSON
 * @param j JSON object to extract from
 * @param key Parameter key to extract
 * @return Vector of extracted integers
 *
 * Safely extracts an integer array parameter from a JSON object, handling
 * type conversion from string if necessary. Returns an empty vector if
 * the key is not found or is not an array.
 */
static std::vector<int> get_json_int_array(const json& j, const std::string& key);
Scoring and Normalization
/**
 * @brief Compute Reciprocal Rank Fusion score
 * @param rank Rank position (1-based)
 * @param k0 Smoothing parameter
 * @param weight Weight factor for this ranking
 * @return RRF score
 *
 * Computes the Reciprocal Rank Fusion score for hybrid search ranking.
 * Formula: weight / (k0 + rank)
 */
double compute_rrf_score(int rank, int k0, double weight);

/**
 * @brief Normalize scores to 0-1 range (higher is better)
 * @param score Raw score to normalize
 * @param score_type Type of score being normalized
 * @return Normalized score in 0-1 range
 *
 * Normalizes various types of scores to a consistent 0-1 range where
 * higher values indicate better matches. Different score types may
 * require different normalization approaches.
 */
double normalize_score(double score, const std::string& score_type);

Tool Specifications

rag.search_fts

Keyword search over documents using FTS5.

Parameters

  • query (string, required): Search query string
  • k (integer): Number of results to return (default: 10, max: 50)
  • offset (integer): Offset for pagination (default: 0)
  • filters (object): Filter criteria for results
  • return (object): Return options for result fields

Filters

  • source_ids (array of integers): Filter by source IDs
  • source_names (array of strings): Filter by source names
  • doc_ids (array of strings): Filter by document IDs
  • min_score (number): Minimum score threshold
  • post_type_ids (array of integers): Filter by post type IDs
  • tags_any (array of strings): Filter by any of these tags
  • tags_all (array of strings): Filter by all of these tags
  • created_after (string): Filter by creation date (after)
  • created_before (string): Filter by creation date (before)

Return Options

  • include_title (boolean): Include title in results (default: true)
  • include_metadata (boolean): Include metadata in results (default: true)
  • include_snippets (boolean): Include snippets in results (default: false)

rag.search_vector

Semantic search over documents using vector embeddings.

Parameters

  • query_text (string, required): Text to search semantically
  • k (integer): Number of results to return (default: 10, max: 50)
  • filters (object): Filter criteria for results
  • embedding (object): Embedding model specification
  • query_embedding (object): Precomputed query embedding
  • return (object): Return options for result fields

rag.search_hybrid

Hybrid search combining FTS and vector search.

Parameters

  • query (string, required): Search query for both FTS and vector
  • k (integer): Number of results to return (default: 10, max: 50)
  • mode (string): Search mode: 'fuse' or 'fts_then_vec'
  • filters (object): Filter criteria for results
  • fuse (object): Parameters for fuse mode
  • fts_then_vec (object): Parameters for fts_then_vec mode

Fuse Mode Parameters

  • fts_k (integer): Number of FTS results for fusion (default: 50)
  • vec_k (integer): Number of vector results for fusion (default: 50)
  • rrf_k0 (integer): RRF smoothing parameter (default: 60)
  • w_fts (number): Weight for FTS scores (default: 1.0)
  • w_vec (number): Weight for vector scores (default: 1.0)

FTS Then Vector Mode Parameters

  • candidates_k (integer): FTS candidates to generate (default: 200)
  • rerank_k (integer): Candidates to rerank with vector search (default: 50)
  • vec_metric (string): Vector similarity metric (default: 'cosine')

rag.get_chunks

Fetch chunk content by chunk_id.

Parameters

  • chunk_ids (array of strings, required): List of chunk IDs to fetch
  • return (object): Return options for result fields

rag.get_docs

Fetch document content by doc_id.

Parameters

  • doc_ids (array of strings, required): List of document IDs to fetch
  • return (object): Return options for result fields

rag.fetch_from_source

Refetch authoritative data from source database.

Parameters

  • doc_ids (array of strings, required): List of document IDs to refetch
  • columns (array of strings): List of columns to fetch
  • limits (object): Limits for the fetch operation

rag.admin.stats

Get operational statistics for RAG system.

Parameters

None

Database Schema

The RAG subsystem uses the following tables in the vector database:

  1. rag_sources: Ingestion configuration and source metadata
  2. rag_documents: Canonical documents with stable IDs
  3. rag_chunks: Chunked content for retrieval
  4. rag_fts_chunks: FTS5 contentless index for keyword search
  5. rag_vec_chunks: sqlite3-vec virtual table for vector similarity search
  6. rag_sync_state: Sync state tracking for incremental ingestion
  7. rag_chunk_view: Convenience view for debugging

Security Features

  1. Input Validation: Strict validation of all parameters and filters
  2. Query Limits: Maximum limits on query length, result count, and candidates
  3. Timeouts: Configurable operation timeouts to prevent resource exhaustion
  4. Column Whitelisting: Strict column filtering for refetch operations
  5. Row and Byte Limits: Maximum limits on returned data size
  6. Parameter Binding: Safe parameter binding to prevent SQL injection

Performance Features

  1. Prepared Statements: Efficient query execution with prepared statements
  2. Connection Management: Proper database connection handling
  3. SQLite3-vec Integration: Optimized vector operations
  4. FTS5 Integration: Efficient full-text search capabilities
  5. Indexing Strategies: Proper database indexing for performance
  6. Result Caching: Efficient result processing and formatting

Configuration Variables

  1. genai_rag_enabled: Enable RAG features
  2. genai_rag_k_max: Maximum k for search results (default: 50)
  3. genai_rag_candidates_max: Maximum candidates for hybrid search (default: 500)
  4. genai_rag_query_max_bytes: Maximum query length in bytes (default: 8192)
  5. genai_rag_response_max_bytes: Maximum response size in bytes (default: 5000000)
  6. genai_rag_timeout_ms: RAG operation timeout in ms (default: 2000)