Architecture
System Design for Document Search & Summarization
Overviewβ
The Document Search & Summarization system is built on a modular, composable architecture that enables flexible configuration through profiles while maintaining high performance and reliability.
Core Componentsβ
βββββββββββββββββββββββββββββββββββββββ
β Document Search Pipeline β
β β
β ββββββββββββββββββββββββββββββββ β
β β Profile Configuration β β
β β (Balanced | Latency | β β
β β Quality) β β
β ββββββββββββ¬ββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββ β
β β Hybrid Retriever β β
β β β’ BM25 + Vector Search β β
β β β’ RRF Fusion β β
β ββββββββββββ¬ββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββ β
β β Cross-Encoder Reranker β β
β β (Optional) β β
β ββββββββββββ¬ββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββ β
β β Grounded Summarizer β β
β β β’ Extractive (TextRank) β β
β β β’ Abstractive (LLM) β β
β β β’ Citations β β
β ββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββ
Architectural Layersβ
Layer 1: Storage & Indexingβ
ββββββββββββββββββββββββββββββββββββββββ
β Vector Stores β
β ββββββββββββ ββββββββββββ β
β βOpenSearchβ β MongoDB β β
β β (BM25+ β β (Vector) β β
β β kNN) β β β β
β ββββββββββββ ββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββ
β
VectorStore interface
Storage Strategy:
- Vector Stores: OpenSearch, MongoDB Atlas, Azure AI Search
- S3: Raw document storage with pre-signed URLs
- Two-tier pattern: S3 for originals, vector stores for search
Layer 2: Retrieval & Rankingβ
ββββββββββββββββββββββββββββββββββββββββ
β Retrieval Pipeline β
β β
β ββββββββββββ ββββββββββββββ β
β β BM25 β β Vector β β
β βRetriever β β Retriever β β
β ββββββ¬ββββββ βββββββ¬βββββββ β
β ββββββββββ¬ββββββββ β
β RRF Fusion β
β β β
β HybridRetriever β
ββββββββββββββββββββ¬ββββββββββββββββββββ
β
βββββββββββββββββ
β Reranker β
β (Cross-Enc) β
βββββββββββββββββ
Key Features:
- BM25Retriever: Keyword search with tunable k1, b parameters
- VectorRetriever: Semantic similarity search
- HybridRetriever: Reciprocal Rank Fusion (RRF) with Ξ±-weighting
- CrossEncoderReranker: Deep reranking with circuit breaker protection
Layer 3: Summarization & Orchestrationβ
GroundedSummarizer provides two modes:
Extractive (TextRank):
- Fast (50-200ms)
- No LLM cost
- 100% faithful to source
- Sentence-level citations
Abstractive (LLM):
- Fluent synthesis
- 2-5s latency
- Includes faithfulness verification
- Hallucination detection
Data Flowβ
End-to-End Document Journeyβ
1. Document Upload
βββ Parse PDF/DOCX/XLSX
βββ Extract metadata
2. Storage
βββ Raw file β S3
βββ Generate pre-signed URL
3. Chunking
βββ Semantic chunking
βββ Create overlapping chunks
4. Embedding
βββ Generate vectors
βββ Batch processing
5. Indexing
βββ Store in vector database
βββ Index for BM25
6. Search
βββ Hybrid retrieval (BM25 + Vector)
βββ RRF fusion
βββ Return top-K results
7. Reranking (Optional)
βββ Cross-encoder scoring
βββ Improved ranking
8. Summarization
βββ Extract/generate summary
βββ Add citations
βββ Verify faithfulness
βββ Return grounded summary
Profile-Based Configurationβ
The Strategy Patternβ
Different use cases require different performance/quality tradeoffs. Profiles encapsulate these strategies:
# Balanced Profile (General Purpose)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.BALANCED,
store, chunks
)
# Config: alpha=0.5, topK=20, light reranking, extractive summary
# Latency-First Profile (Interactive)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.LATENCY_FIRST,
store, chunks
)
# Config: alpha=0.7, topK=10, no reranking, extractive summary
# Quality-First Profile (Research)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.QUALITY_FIRST,
store, chunks
)
# Config: alpha=0.5, topK=50, deep reranking, abstractive summary
Profile Comparisonβ
Profile | Latency | Quality | Cost | Use Case |
---|---|---|---|---|
Balanced | 500ms | Good (0.7-0.8) | $0.60/1K | General Q&A |
Latency-First | 250ms | Acceptable (0.7) | $0.35/1K | Interactive |
Quality-First | 5s | Excellent (0.85-0.95) | $52/1K | Research |
Design Patternsβ
1. Dependency Injectionβ
Components are injected rather than hard-coded:
class DocumentSearchPipeline:
def __init__(self, retriever, summarizer, config, reranker=None):
self.retriever = retriever # Injected
self.reranker = reranker # Optional
self.summarizer = summarizer # Injected
self.config = config # Injected
Benefits: Easy testing, flexible configuration, loose coupling
2. Factory Methodβ
Hide complex initialization:
@classmethod
def create_profile(cls, profile: ProfileType, vector_store, chunks):
"""Factory method for profile-based initialization."""
if profile == ProfileType.BALANCED:
return cls._create_balanced_profile(vector_store, chunks)
elif profile == ProfileType.LATENCY_FIRST:
return cls._create_latency_first_profile(vector_store, chunks)
elif profile == ProfileType.QUALITY_FIRST:
return cls._create_quality_first_profile(vector_store, chunks)
3. Pipeline Patternβ
Break complex processes into discrete stages:
Input: Document File
β
βββββββββββββββββββββββ
β Stage 1: Load β
β Parse document β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββ ββββββ
β Stage 2: Store β
β Upload to S3 β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β Stage 3: Chunk β
β Split into pieces β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β Stage 4: Embed β
β Generate vectors β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β Stage 5: Index β
β Store in database β
ββββββββββββ¬βββββββββββ
β
Output: IndexingResult
Benefits: Independently testable stages, progress tracking, error isolation
Storage Architectureβ
Two-Storage Patternβ
Different data types have different storage requirements:
Document Storage:
βββ Raw Files (S3)
β βββ Large files (100MB+)
β βββ Immutable originals
β βββ Pre-signed URLs
β
βββ Searchable Content (Vector Store)
β βββ Chunked and embedded
β βββ BM25 + kNN indexes
β βββ Fast retrieval
β
βββ Metadata (MongoDB/PostgreSQL)
βββ Document info
βββ Processing status
βββ Relationships
Why Two Storage Systems?β
Need | S3 | Vector Store | Reason |
---|---|---|---|
Store 100MB PDF | β | β | Vector stores optimized for small chunks |
Full-text search | β | β | S3 is object storage, not searchable |
Keep original | β | β | Vector stores only keep chunks |
Fast retrieval | β | β | Vector stores optimized for search |
Pre-signed URLs | β | β | S3 feature for direct download |
S3 Pre-Signed URLsβ
Temporary, secure URLs for direct access:
# Generate URL valid for 1 hour
url = s3_storage.generate_presigned_url(
s3_key="documents/2025/10/doc123/report.pdf",
expiration=3600
)
Security Benefits:
- Time-limited (expires after N seconds)
- No credentials exposed
- Can't access other objects
- Auditable via CloudTrail
Scalability Architectureβ
Horizontal Scalingβ
Stateless design allows multiple instances:
Load Balancer
β
ββββββββββββΌβββββββββββ
β β β
Pipeline Pipeline Pipeline
Instance 1 Instance 2 Instance 3
β β β
ββββββββββββΌβββββββββββ
β
Shared Storage
ββββββββββββΌβββββββββββ
β β β
OpenSearch MongoDB S3
Scaling Characteristics:
- No state in pipeline instances
- All state in shared stores
- Add instances = more throughput
- Each instance handles 100+ QPS
Multi-Level Cachingβ
Query comes in
β
L1: Redis - Full result cache
Cache hit? β Return (5ms)
Cache miss? β
L2: Redis - Retrieval cache
Cache hit? β Skip to summary (100ms)
Cache miss? β
L3: Execute full pipeline (500ms)
β
Cache results in L1 & L2
Performance Impact:
- 40-60% cache hit rate
- 95% latency reduction on cache hits
- 50-70% cost reduction
Error Handlingβ
Fail Gracefully Strategyβ
Preserve progress through pipeline stages:
def index_document(file_path):
result = IndexingResult(status='started')
try:
# Stage 1: Load
doc = load_document(file_path)
result.doc_loaded = True
except Exception as e:
result.status = 'failed'
result.error = f"Loading failed: {e}"
return result # Exit early
try:
# Stage 2: S3 (non-fatal)
s3_key = upload_to_s3(doc)
result.s3_uploaded = True
except Exception as e:
log_warning(f"S3 upload failed: {e}")
# Continue without S3
# ... continue through stages
Key Principles:
- Some failures are fatal (stop processing)
- Some failures are warnings (continue)
- Always preserve progress state
- Return partial results when possible
Design Tradeoffsβ
Performance vs Qualityβ
Solved through profiles:
Profile | Performance | Quality | Cost |
---|---|---|---|
Latency-First | βββ | ββ | β |
Balanced | ββ | ββ | ββ |
Quality-First | β | βββ | βββ |
Extractive vs Abstractiveβ
Extractive (Default):
- β Fast (50-200ms)
- β No LLM cost
- β 100% faithful
- β Less fluent
Abstractive (Opt-in):
- β Fluent synthesis
- β Better readability
- β Slow (2-5s)
- β Expensive ($0.01-0.10)
- β Hallucination risk
Simplicity vs Flexibilityβ
Simple (90% use case):
pipeline = Pipeline.create_profile(ProfileType.BALANCED, ...)
Flexible (10% use case):
custom_config = PipelineConfig(
topK=30,
alpha=0.6,
enable_reranking=True,
latency_budget_ms=1000
)
pipeline = Pipeline(..., config=custom_config)
Extension Pointsβ
Adding a New Profileβ
@classmethod
def _create_custom_profile(cls, store, chunks):
"""Add your custom profile."""
retriever = HybridRetriever(
...,
alpha=0.6,
vector_k=30
)
reranker = CrossEncoderReranker(
model_name="your-model"
)
summarizer = GroundedSummarizer(
mode=SummarizationMode.EXTRACTIVE,
faithfulness_threshold=0.90
)
config = PipelineConfig(
profile=ProfileType.CUSTOM,
topK=30,
latency_budget_ms=1000,
...
)
return cls(retriever, summarizer, config, reranker)
Adding a New Vector Storeβ
from packages.rag.stores import VectorStore
class YourCustomStore(VectorStore):
def add_documents(self, documents): ...
def search(self, query_embedding, k): ...
def delete_documents(self, ids): ...
def get_stats(self): ...
# Use it
pipeline = Pipeline.create_profile(
profile=ProfileType.BALANCED,
vector_store=YourCustomStore(...),
chunks=chunks
)
Component Diagramβ
DocumentSearchPipeline
βββ HybridRetriever
β βββ BM25Retriever
β βββ VectorRetriever
β βββ VectorStore
β βββ OpenSearchStore
β βββ MongoDBStore
β βββ AzureAISearchStore
β
βββ CrossEncoderReranker (optional)
β
βββ GroundedSummarizer
βββ Extractive (sumy/TextRank)
βββ Abstractive (LangChain + OpenAI)
DocumentIndexingPipeline
βββ DocumentLoader
βββ SemanticChunker
βββ VectorStore
βββ S3DocumentStorage
Summaryβ
Architectural Principlesβ
- Modularity: Compose small, focused components
- Configuration: Profiles over code variations
- Strategy Pattern: Encapsulate profile configurations
- Dependency Injection: Loose coupling, easy testing
- Fail Gracefully: Preserve progress, handle errors
- Horizontal Scaling: Stateless design
Key Featuresβ
- Profile-based configuration (Balanced, Latency-First, Quality-First)
- Hybrid retrieval (BM25 + Vector with RRF)
- Optional cross-encoder reranking
- Grounded summarization with citations
- Two-tier storage (S3 + Vector Store)
- Multi-level caching
- Horizontal scalability
Next: Storage & Indexing | API Integration