Skip to main content

RAG Chunkers

Document chunking strategies and implementations for RAG systems.

Overview

The RAG chunkers system provides various document chunking strategies optimized for different types of content and retrieval scenarios.

Core Features

  • Multiple Chunking Strategies: Text, semantic, and hybrid chunking
  • Content-Aware Chunking: Adapt to different document types
  • Overlap Management: Configurable chunk overlap
  • Metadata Preservation: Maintain document metadata
  • Performance Optimization: Efficient chunking algorithms

Usage Examples

Basic Text Chunking

from recoagent.rag.chunkers import TextChunker

# Create text chunker
chunker = TextChunker(
chunk_size=1000,
chunk_overlap=200
)

# Chunk document
chunks = chunker.chunk_document(
text="Long document text...",
metadata={"source": "document.pdf", "page": 1}
)

Advanced Chunking

from recoagent.rag.chunkers import SemanticChunker

# Create semantic chunker
semantic_chunker = SemanticChunker(
embedding_model="text-embedding-ada-002",
similarity_threshold=0.8
)

# Chunk with semantic awareness
semantic_chunks = semantic_chunker.chunk_document(
text="Document with semantic structure...",
preserve_semantics=True
)

API Reference

TextChunker Methods

chunk_document(text: str, metadata: Dict = None) -> List[Chunk]

Chunk document into text chunks

Parameters:

  • text (str): Document text
  • metadata (Dict, optional): Document metadata

Returns: List of chunks

SemanticChunker Methods

chunk_document(text: str, preserve_semantics: bool = True) -> List[Chunk]

Chunk document with semantic awareness

Parameters:

  • text (str): Document text
  • preserve_semantics (bool): Preserve semantic boundaries

Returns: List of semantic chunks

See Also