17 KiB

Raw Blame History

RAG Source Configuration Guide: `chunking_json` and `embedding_json`

This guide explains how to configure document chunking and vector embedding generation in the ProxySQL RAG ingestion system.

Overview
chunking_json
embedding_json
Complete Examples
Best Practices
Troubleshooting

Overview

The rag_sources table stores configuration for ingesting data into the RAG index. Two key JSON columns control how documents are processed:

Column	Purpose	Required
`chunking_json`	Controls how documents are split into chunks	Yes
`embedding_json`	Controls how vector embeddings are generated	No

Both columns accept JSON objects with specific fields that define the behavior of the ingestion pipeline.

chunking_json

The chunking_json column defines how documents are split into smaller pieces (chunks) for indexing. Chunking is important because:

Retrieval precision: Smaller chunks allow more precise matching
Context management: LLMs work better with focused, sized appropriately content
Indexing efficiency: FTS and vector search perform better on sized units

Schema

CREATE TABLE rag_sources (
    -- ...
    chunking_json TEXT NOT NULL,  -- REQUIRED
    -- ...
);

Fields

Field	Type	Default	Description
`enabled`	boolean	`true`	Enable/disable chunking. When `false`, entire document is a single chunk.
`unit`	string	`"chars"`	Unit of measurement. Only `"chars"` is supported in v0.
`chunk_size`	integer	`4000`	Maximum size of each chunk (in characters).
`overlap`	integer	`400`	Number of characters shared between consecutive chunks.
`min_chunk_size`	integer	`800`	Minimum size for the last chunk. If smaller, merges with previous chunk.

Validation Rules

Condition	Action
`chunk_size <= 0`	Reset to `4000`
`overlap < 0`	Reset to `0`
`overlap >= chunk_size`	Reset to `chunk_size / 4`
`min_chunk_size < 0`	Reset to `0`
`unit != "chars"`	Warning logged, falls back to `"chars"`

Chunking Algorithm

The chunker uses a sliding window approach:

Document: "A long document text that needs to be split..."

With chunk_size=20, overlap=5:

Chunk 0: [0-19]   "A long document tex"
Chunk 1: [15-34]  "ment text that needs "
Chunk 2: [30-49]  "to be split..."

Algorithm steps:

If enabled=false, return entire document as single chunk
If document size <= chunk_size, return as single chunk
Calculate step: step = chunk_size - overlap
Slide window across document by step characters
For final chunk: if size < min_chunk_size, append to previous chunk

Examples

Example 1: Disable Chunking

{
  "enabled": false
}

Use case: Small documents (posts, comments) that don't need splitting.

Example 2: Default Configuration

{
  "enabled": true,
  "unit": "chars",
  "chunk_size": 4000,
  "overlap": 400,
  "min_chunk_size": 800
}

Use case: General-purpose content like articles, documentation.

Example 3: Smaller Chunks for Code

{
  "enabled": true,
  "unit": "chars",
  "chunk_size": 1500,
  "overlap": 200,
  "min_chunk_size": 500
}

Use case: Code or technical content where smaller, more focused chunks improve retrieval.

Example 4: Large Chunks for Long-form Content

{
  "enabled": true,
  "unit": "chars",
  "chunk_size": 8000,
  "overlap": 800,
  "min_chunk_size": 2000
}

Use case: Books, long reports where maintaining more context per chunk is beneficial.

Visual Example

For a 10,000 character document with chunk_size=4000, overlap=400, min_chunk_size=800:

Chunk 0: chars 0-3999     (4000 chars)
Chunk 1: chars 3600-7599  (4000 chars, overlaps by 400)
Chunk 2: chars 7200-9999  (2799 chars - kept since > min_chunk_size)

Result: 3 chunks

With a 7,500 character document:

Chunk 0: chars 0-3999     (4000 chars)
Chunk 1: chars 3600-7499  (3899 chars - final chunk merged)

Result: 2 chunks (last 101 chars merged into Chunk 1 since < min_chunk_size)

embedding_json

The embedding_json column defines how vector embeddings are generated for semantic search. Embeddings convert text into numerical vectors that capture semantic meaning.

Schema

CREATE TABLE rag_sources (
    -- ...
    embedding_json TEXT,  -- OPTIONAL (can be NULL)
    -- ...
);

Fields

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable/disable embedding generation.
`dim`	integer	`1536`	Vector dimension (must match model output).
`model`	string	`"unknown"`	Model name/identifier (for observability).
`provider`	string	`"stub"`	Embedding service: `"openai"` or `"stub"`.
`api_base`	string	(empty)	API endpoint URL.
`api_key`	string	(empty)	API authentication key.
`batch_size`	integer	`16`	Number of chunks processed per API request.
`timeout_ms`	integer	`20000`	HTTP request timeout in milliseconds.
`input`	object	(see below)	Specifies how to build embedding input text.

Validation Rules

Condition	Action
`dim <= 0`	Reset to `1536`
`batch_size <= 0`	Reset to `16`
`timeout_ms <= 0`	Reset to `20000`

Provider Types

1. `stub` Provider

Generates deterministic pseudo-embeddings by hashing input text. Used for testing without API calls.

{
  "enabled": true,
  "provider": "stub",
  "dim": 1536
}

Benefits:

No network dependency
No API costs
Fast execution
Deterministic output

Use case: Development, testing, CI/CD pipelines.

2. `openai` Provider

Connects to OpenAI or OpenAI-compatible APIs (e.g., Azure OpenAI, local LLM servers).

{
  "enabled": true,
  "provider": "openai",
  "api_base": "https://api.openai.com/v1",
  "api_key": "sk-your-api-key",
  "model": "text-embedding-3-small",
  "dim": 1536,
  "batch_size": 16,
  "timeout_ms": 30000
}

Benefits:

High-quality semantic embeddings
Batch processing support
Wide model compatibility

Use case: Production semantic search.

The `input` Field

The input field defines what text is sent to the embedding model. It uses a concat specification to combine:

Column values from source row: {"col": "ColumnName"}
Literal strings: {"lit": "text"}
Chunk body: {"chunk_body": true}

Default Behavior

If input is not specified, only the chunk body is embedded.

Custom Input Example

{
  "enabled": true,
  "provider": "openai",
  "dim": 1536,
  "input": {
    "concat": [
      {"col": "Title"},
      {"lit": "\nTags: "},
      {"col": "Tags"},
      {"lit": "\n\n"},
      {"chunk_body": true}
    ]
  }
}

Result: Embeds: {Title}\nTags: {Tags}\n\n{ChunkBody}

This typically improves semantic recall by including title and tags in the embedding.

Input Builder Algorithm

// Simplified representation
if (input_spec contains "concat") {
    result = ""
    for each part in concat:
        if part has "col":   result += row[part.col]
        if part has "lit":   result += part.lit
        if part has "chunk_body" && true: result += chunk_body
    return result
}
return chunk_body  // fallback

Batching Behavior

Embeddings are generated in batches to reduce API calls:

With batch_size=16 and 100 chunks:

Without batching: 100 API calls
With batching:    7 API calls (16+16+16+16+16+16+4)

Batch process:

Collect up to batch_size chunks
Call embedding API once with all inputs
Store all resulting vectors
Repeat until all chunks processed

Examples

Example 1: Disabled (FTS Only)

{
  "enabled": false
}

Or leave embedding_json as NULL. Only full-text search will be available.

Example 2: Stub for Testing

{
  "enabled": true,
  "provider": "stub",
  "dim": 1536
}

Example 3: OpenAI with Defaults

{
  "enabled": true,
  "provider": "openai",
  "api_base": "https://api.openai.com/v1",
  "api_key": "sk-your-key",
  "model": "text-embedding-3-small",
  "dim": 1536
}

Example 4: OpenAI with Custom Input

{
  "enabled": true,
  "provider": "openai",
  "api_base": "https://api.openai.com/v1",
  "api_key": "sk-your-key",
  "model": "text-embedding-3-small",
  "dim": 1536,
  "batch_size": 32,
  "timeout_ms": 45000,
  "input": {
    "concat": [
      {"col": "Title"},
      {"lit": "\n"},
      {"chunk_body": true}
    ]
  }
}

Example 5: Local LLM Server

{
  "enabled": true,
  "provider": "openai",
  "api_base": "http://localhost:8080/v1",
  "api_key": "dummy",
  "model": "nomic-embed-text",
  "dim": 768,
  "batch_size": 8,
  "timeout_ms": 60000
}

Complete Examples

Example 1: Basic StackOverflow Posts Source

INSERT INTO rag_sources (
    source_id, name, enabled, backend_type,
    backend_host, backend_port, backend_user, backend_pass, backend_db,
    table_name, pk_column,
    doc_map_json,
    chunking_json,
    embedding_json
) VALUES (
    1, 'stack_posts', 1, 'mysql',
    '127.0.0.1', 3306, 'root', 'root', 'stackdb',
    'posts', 'Id',
    '{
        "doc_id": {"format": "posts:{Id}"},
        "title": {"concat": [{"col": "Title"}]},
        "body": {"concat": [{"col": "Body"}]},
        "metadata": {"pick": ["Id", "Tags", "Score"]}
    }',
    '{
        "enabled": true,
        "unit": "chars",
        "chunk_size": 4000,
        "overlap": 400,
        "min_chunk_size": 800
    }',
    '{
        "enabled": true,
        "provider": "openai",
        "api_base": "https://api.openai.com/v1",
        "api_key": "sk-your-key",
        "model": "text-embedding-3-small",
        "dim": 1536,
        "batch_size": 16,
        "timeout_ms": 30000,
        "input": {
            "concat": [
                {"col": "Title"},
                {"lit": "\nTags: "},
                {"col": "Tags"},
                {"lit": "\n\n"},
                {"chunk_body": true}
            ]
        }
    }'
);

Example 2: Documentation Articles (Small Chunks, Stub Embeddings)

INSERT INTO rag_sources (
    source_id, name, enabled, backend_type,
    backend_host, backend_port, backend_user, backend_pass, backend_db,
    table_name, pk_column,
    doc_map_json,
    chunking_json,
    embedding_json
) VALUES (
    2, 'docs', 1, 'mysql',
    '127.0.0.1', 3306, 'root', 'root', 'docsdb',
    'articles', 'article_id',
    '{
        "doc_id": {"format": "docs:{article_id}"},
        "title": {"concat": [{"col": "title"}]},
        "body": {"concat": [{"col": "content"}]},
        "metadata": {"pick": ["category", "author"]}
    }',
    '{
        "enabled": true,
        "unit": "chars",
        "chunk_size": 1500,
        "overlap": 200,
        "min_chunk_size": 500
    }',
    '{
        "enabled": true,
        "provider": "stub",
        "dim": 1536
    }'
);

Example 3: GitHub Issues (No Chunking, Real Embeddings)

INSERT INTO rag_sources (
    source_id, name, enabled, backend_type,
    backend_host, backend_port, backend_user, backend_pass, backend_db,
    table_name, pk_column,
    doc_map_json,
    chunking_json,
    embedding_json
) VALUES (
    3, 'github_issues', 1, 'mysql',
    '127.0.0.1', 3306, 'root', 'root', 'githubdb',
    'issues', 'id',
    '{
        "doc_id": {"format": "issues:{id}"},
        "title": {"concat": [{"col": "title"}]},
        "body": {"concat": [{"col": "body"}]},
        "metadata": {"pick": ["number", "state", "labels"]}
    }',
    '{
        "enabled": false
    }',
    '{
        "enabled": true,
        "provider": "openai",
        "api_base": "https://api.openai.com/v1",
        "api_key": "sk-your-key",
        "model": "text-embedding-3-small",
        "dim": 1536,
        "input": {
            "concat": [
                {"col": "title"},
                {"lit": "\n\n"},
                {"chunk_body": true}
            ]
        }
    }'
);

Best Practices

Chunking

Match content to chunk size:
- Short posts/comments: Disable chunking (enabled: false)
- Articles/docs: 3000-5000 characters
- Code: 1000-2000 characters
- Books: 6000-10000 characters
Set overlap to 10-20% of chunk size:
- Provides context continuity
- Helps avoid cutting important information
Set min_chunk_size to 20-25% of chunk size:
- Prevents tiny trailing chunks
- Reduces noise in search results
Consider your embedding token limit:
- OpenAI: ~8191 tokens per input
- If using embeddings, ensure chunk_size doesn't exceed this

Embeddings

Use stub provider for development:
- Faster iteration
- No API costs
- Deterministic output
Optimize batch_size for your API:
- OpenAI: max 16-2048 (depending on endpoint)
- Local servers: typically lower (4-16)
- Larger batches = fewer API calls but more memory
Include relevant context in input:
- Title, tags, category improve semantic quality
- Don't include numeric metadata (scores, IDs)
- Keep input focused and clean
Set appropriate timeouts:
- OpenAI: 20-30 seconds usually sufficient
- Local servers: may need 60+ seconds
- Consider retries for failed requests
Match dimension to model:
- text-embedding-3-small: 1536
- text-embedding-3-large: 3072
- nomic-embed-text: 768
- Custom models: check documentation

Common Pitfalls

Too large chunks: Reduces retrieval precision
Too small chunks: Loses context, increases noise
No overlap: Misses information at chunk boundaries
Too large overlap: Increases index size, redundancy
Wrong dimension: Causes vector insertion failures
Forgetting API key: Silent failures in some configs

Troubleshooting

Chunking Issues

Problem: Too many small chunks

Symptoms: High chunk count, many chunks under 500 characters

Solution: Increase min_chunk_size or decrease chunk_size

{
  "chunk_size": 3000,
  "min_chunk_size": 1000
}

Problem: Important context split between chunks

Symptoms: Search misses information that spans chunk boundaries

Solution: Increase overlap

{
  "chunk_size": 4000,
  "overlap": 800
}

Embedding Issues

Problem: "embedding dimension mismatch"

Cause: dim field doesn't match actual model output

Solution: Verify model dimension and update config

{
  "model": "text-embedding-3-small",
  "dim": 1536  // Must match model
}

Problem: Timeout errors

Symptoms: Embedding requests fail after timeout

Solutions:

Increase timeout_ms
Decrease batch_size
Check network connectivity

{
  "timeout_ms": 60000,
  "batch_size": 8
}

Problem: API rate limit errors

Symptoms: HTTP 429 errors from embedding service

Solutions:

Decrease batch_size
Add delay between batches (requires code change)
Upgrade API tier

{
  "batch_size": 4
}

Problem: "embedding api_base is empty"

Cause: Using openai provider without setting api_base

Solution: Set api_base to your endpoint

{
  "provider": "openai",
  "api_base": "https://api.openai.com/v1",
  "api_key": "sk-your-key"
}

Verification Queries

-- Check current configuration
SELECT
    source_id,
    name,
    chunking_json,
    embedding_json
FROM rag_sources
WHERE enabled = 1;

-- Count chunks per document
SELECT
    d.source_id,
    AVG((SELECT COUNT(*) FROM rag_chunks c WHERE c.doc_id = d.doc_id)) as avg_chunks
FROM rag_documents d
GROUP BY d.source_id;

-- Check vector counts
SELECT
    source_id,
    COUNT(*) as vector_count
FROM rag_vec_chunks
GROUP BY source_id;

-- Verify dimensions match
SELECT
    source_id,
    COUNT(*) as count,
    -- This would need a custom query to extract vector length
    'Verify dim in rag_sources.embedding_json matches model' as check
FROM rag_vec_chunks
GROUP BY source_id;

Quick Reference

Minimum Configurations

-- FTS only (no embeddings)
INSERT INTO rag_sources (..., chunking_json, embedding_json) VALUES (
    ...,
    '{"enabled": false}',
    NULL
);

-- Chunking + stub embeddings (testing)
INSERT INTO rag_sources (..., chunking_json, embedding_json) VALUES (
    ...,
    '{"enabled":true,"chunk_size":4000,"overlap":400}',
    '{"enabled":true,"provider":"stub","dim":1536}'
);

-- Chunking + OpenAI embeddings (production)
INSERT INTO rag_sources (..., chunking_json, embedding_json) VALUES (
    ...,
    '{"enabled":true,"chunk_size":4000,"overlap":400}',
    '{"enabled":true,"provider":"openai","api_base":"https://api.openai.com/v1","api_key":"sk-...","model":"text-embedding-3-small","dim":1536}'
);

Common Model Dimensions

Model	Dimension
`text-embedding-3-small`	1536
`text-embedding-3-large`	3072
`text-embedding-ada-002`	1536
`nomic-embed-text`	768
`all-MiniLM-L6-v2`	384

INGEST_USAGE_GUIDE.md - Complete ingestion tool usage
embeddings-design.md - Embedding architecture design
schema.sql - Database schema reference

17 KiB Raw Blame History

RAG Source Configuration Guide: chunking_json and embedding_json

Table of Contents

Overview

chunking_json

Schema

Fields

Validation Rules

Chunking Algorithm

Examples

Example 1: Disable Chunking

Example 2: Default Configuration

Example 3: Smaller Chunks for Code

Example 4: Large Chunks for Long-form Content

Visual Example

embedding_json

Schema

Fields

Validation Rules

Provider Types

1. stub Provider

2. openai Provider

The input Field

Default Behavior

Custom Input Example

Input Builder Algorithm

Batching Behavior

Examples

Example 1: Disabled (FTS Only)

Example 2: Stub for Testing

Example 3: OpenAI with Defaults

Example 4: OpenAI with Custom Input

Example 5: Local LLM Server

Complete Examples

Example 1: Basic StackOverflow Posts Source

Example 2: Documentation Articles (Small Chunks, Stub Embeddings)

Example 3: GitHub Issues (No Chunking, Real Embeddings)

Best Practices

Chunking

Embeddings

Common Pitfalls

Troubleshooting

Chunking Issues

Problem: Too many small chunks

Problem: Important context split between chunks

Embedding Issues

Problem: "embedding dimension mismatch"

Problem: Timeout errors

Problem: API rate limit errors

Problem: "embedding api_base is empty"

Verification Queries

Quick Reference

Minimum Configurations

Common Model Dimensions

Related Documentation

17 KiB

Raw Blame History

RAG Source Configuration Guide: `chunking_json` and `embedding_json`

1. `stub` Provider

2. `openai` Provider

The `input` Field