Architecture
System Design for Document Search & Summarization
Overview
The Document Search & Summarization system is built on a modular, composable architecture that enables flexible configuration through profiles while maintaining high performance and reliability.
Core Components
┌─────────────────────────────────────┐
│ Document Search Pipeline │
│ │
│ ┌──────────────────────────────┐ │
│ │ Profile Configuration │ │
│ │ (Balanced | Latency | │ │
│ │ Quality) │ │
│ └──────────┬───────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────┐ │
│ │ Hybrid Retriever │ │
│ │ • BM25 + Vector Search │ │
│ │ • RRF Fusion │ │
│ └──────────┬───────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────┐ │
│ │ Cross-Encoder Reranker │ │
│ │ (Optional) │ │
│ └──────────┬───────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────┐ │
│ │ Grounded Summarizer │ │
│ │ • Extractive (TextRank) │ │
│ │ • Abstractive (LLM) │ │
│ │ • Citations │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘
Architectural Layers
Layer 1: Storage & Indexing
┌──────────────────────────────────────┐
│ Vector Stores │
│ ┌──────────┐ ┌──────────┐ │
│ │OpenSearch│ │ MongoDB │ │
│ │ (BM25+ │ │ (Vector) │ │
│ │ kNN) │ │ │ │
│ └──────────┘ └──────────┘ │
└──────────────────────────────────────┘
↑
VectorStore interface
Storage Strategy:
- Vector Stores: OpenSearch, MongoDB Atlas, Azure AI Search
- S3: Raw document storage with pre-signed URLs
- Two-tier pattern: S3 for originals, vector stores for search
Layer 2: Retrieval & Ranking
┌──────────────────────────────────────┐
│ Retrieval Pipeline │
│ │
│ ┌──────────┐ ┌────────────┐ │
│ │ BM25 │ │ Vector │ │
│ │Retriever │ │ Retriever │ │
│ └────┬─────┘ └─────┬──────┘ │
│ └────────┬───────┘ │
│ RRF Fusion │
│ ↓ │
│ HybridRetriever │
└──────────────────┬───────────────────┘
↓
┌───────────────┐
│ Reranker │
│ (Cross-Enc) │
└───────────────┘
Key Features:
- BM25Retriever: Keyword search with tunable k1, b parameters
- VectorRetriever: Semantic similarity search
- HybridRetriever: Reciprocal Rank Fusion (RRF) with α-weighting
- CrossEncoderReranker: Deep reranking with circuit breaker protection
Layer 3: Summarization & Orchestration
GroundedSummarizer provides two modes:
Extractive (TextRank):
- Fast (50-200ms)
- No LLM cost
- 100% faithful to source
- Sentence-level citations
Abstractive (LLM):
- Fluent synthesis
- 2-5s latency
- Includes faithfulness verification
- Hallucination detection
Data Flow
End-to-End Document Journey
1. Document Upload
├── Parse PDF/DOCX/XLSX
└── Extract metadata
2. Storage
├── Raw file → S3
└── Generate pre-signed URL
3. Chunking
├── Semantic chunking
└── Create overlapping chunks
4. Embedding
├── Generate vectors
└── Batch processing
5. Indexing
├── Store in vector database
└── Index for BM25
6. Search
├── Hybrid retrieval (BM25 + Vector)
├── RRF fusion
└── Return top-K results
7. Reranking (Optional)
├── Cross-encoder scoring
└── Improved ranking
8. Summarization
├── Extract/generate summary
├── Add citations
├── Verify faithfulness
└── Return grounded summary
Profile-Based Configuration
The Strategy Pattern
Different use cases require different performance/quality tradeoffs. Profiles encapsulate these strategies:
# Balanced Profile (General Purpose)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.BALANCED,
store, chunks
)
# Config: alpha=0.5, topK=20, light reranking, extractive summary
# Latency-First Profile (Interactive)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.LATENCY_FIRST,
store, chunks
)
# Config: alpha=0.7, topK=10, no reranking, extractive summary
# Quality-First Profile (Research)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.QUALITY_FIRST,
store, chunks
)
# Config: alpha=0.5, topK=50, deep reranking, abstractive summary
Profile Comparison
| Profile | Latency | Quality | Cost | Use Case |
|---|---|---|---|---|
| Balanced | 500ms | Good (0.7-0.8) | $0.60/1K | General Q&A |
| Latency-First | 250ms | Acceptable (0.7) | $0.35/1K | Interactive |
| Quality-First | 5s | Excellent (0.85-0.95) | $52/1K | Research |
Design Patterns
1. Dependency Injection
Components are injected rather than hard-coded:
class DocumentSearchPipeline:
def __init__(self, retriever, summarizer, config, reranker=None):
self.retriever = retriever # Injected
self.reranker = reranker # Optional
self.summarizer = summarizer # Injected
self.config = config # Injected
Benefits: Easy testing, flexible configuration, loose coupling
2. Factory Method
Hide complex initialization:
@classmethod
def create_profile(cls, profile: ProfileType, vector_store, chunks):
"""Factory method for profile-based initialization."""
if profile == ProfileType.BALANCED:
return cls._create_balanced_profile(vector_store, chunks)
elif profile == ProfileType.LATENCY_FIRST:
return cls._create_latency_first_profile(vector_store, chunks)
elif profile == ProfileType.QUALITY_FIRST:
return cls._create_quality_first_profile(vector_store, chunks)
3. Pipeline Pattern
Break complex processes into discrete stages:
Input: Document File
↓
┌─────────────────────┐
│ Stage 1: Load │
│ Parse document │
└──────────┬──────────┘
↓
┌─────────────────────┐
│ Stage 2: Store │
│ Upload to S3 │
└──────────┬──────────┘
↓
┌─────────────────────┐
│ Stage 3: Chunk │
│ Split into pieces │
└──────────┬──────────┘
↓
┌─────────────────────┐
│ Stage 4: Embed │
│ Generate vectors │
└──────────┬──────────┘
↓
┌─────────────────────┐
│ Stage 5: Index │
│ Store in database │
└──────────┬──────────┘
↓
Output: IndexingResult
Benefits: Independently testable stages, progress tracking, error isolation
Storage Architecture
Two-Storage Pattern
Different data types have different storage requirements:
Document Storage:
├── Raw Files (S3)
│ ├── Large files (100MB+)
│ ├── Immutable originals
│ └── Pre-signed URLs
│
├── Searchable Content (Vector Store)
│ ├── Chunked and embedded
│ ├── BM25 + kNN indexes
│ └── Fast retrieval
│
└── Metadata (MongoDB/PostgreSQL)
├── Document info
├── Processing status
└── Relationships
Why Two Storage Systems?
| Need | S3 | Vector Store | Reason |
|---|---|---|---|
| Store 100MB PDF | ✅ | ❌ | Vector stores optimized for small chunks |
| Full-text search | ❌ | ✅ | S3 is object storage, not searchable |
| Keep original | ✅ | ❌ | Vector stores only keep chunks |
| Fast retrieval | ❌ | ✅ | Vector stores optimized for search |
| Pre-signed URLs | ✅ | ❌ | S3 feature for direct download |
S3 Pre-Signed URLs
Temporary, secure URLs for direct access:
# Generate URL valid for 1 hour
url = s3_storage.generate_presigned_url(
s3_key="documents/2025/10/doc123/report.pdf",
expiration=3600
)
Security Benefits:
- Time-limited (expires after N seconds)
- No credentials exposed
- Can't access other objects
- Auditable via CloudTrail
Scalability Architecture
Horizontal Scaling
Stateless design allows multiple instances:
Load Balancer
↓
┌──────────┼──────────┐
↓ ↓ ↓
Pipeline Pipeline Pipeline
Instance 1 Instance 2 Instance 3
↓ ↓ ↓
└──────────┼──────────┘
↓
Shared Storage
┌──────────┼──────────┐
↓ ↓ ↓
OpenSearch MongoDB S3
Scaling Characteristics:
- No state in pipeline instances
- All state in shared stores
- Add instances = more throughput
- Each instance handles 100+ QPS
Multi-Level Caching
Query comes in
↓
L1: Redis - Full result cache
Cache hit? → Return (5ms)
Cache miss? ↓
L2: Redis - Retrieval cache
Cache hit? → Skip to summary (100ms)
Cache miss? ↓
L3: Execute full pipeline (500ms)
↓
Cache results in L1 & L2
Performance Impact:
- 40-60% cache hit rate
- 95% latency reduction on cache hits
- 50-70% cost reduction
Error Handling
Fail Gracefully Strategy
Preserve progress through pipeline stages:
def index_document(file_path):
result = IndexingResult(status='started')
try:
# Stage 1: Load
doc = load_document(file_path)
result.doc_loaded = True
except Exception as e:
result.status = 'failed'
result.error = f"Loading failed: {e}"
return result # Exit early
try:
# Stage 2: S3 (non-fatal)
s3_key = upload_to_s3(doc)
result.s3_uploaded = True
except Exception as e:
log_warning(f"S3 upload failed: {e}")
# Continue without S3
# ... continue through stages
Key Principles:
- Some failures are fatal (stop processing)
- Some failures are warnings (continue)
- Always preserve progress state
- Return partial results when possible
Design Tradeoffs
Performance vs Quality
Solved through profiles:
| Profile | Performance | Quality | Cost |
|---|---|---|---|
| Latency-First | ⭐⭐⭐ | ⭐⭐ | ⭐ |
| Balanced | ⭐⭐ | ⭐⭐ | ⭐⭐ |
| Quality-First | ⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
Extractive vs Abstractive
Extractive (Default):
- ✅ Fast (50-200ms)
- ✅ No LLM cost
- ✅ 100% faithful
- ❌ Less fluent
Abstractive (Opt-in):
- ✅ Fluent synthesis
- ✅ Better readability
- ❌ Slow (2-5s)
- ❌ Expensive ($0.01-0.10)
- ❌ Hallucination risk
Simplicity vs Flexibility
Simple (90% use case):
pipeline = Pipeline.create_profile(ProfileType.BALANCED, ...)
Flexible (10% use case):
custom_config = PipelineConfig(
topK=30,
alpha=0.6,
enable_reranking=True,
latency_budget_ms=1000
)
pipeline = Pipeline(..., config=custom_config)
Extension Points
Adding a New Profile
@classmethod
def _create_custom_profile(cls, store, chunks):
"""Add your custom profile."""
retriever = HybridRetriever(
...,
alpha=0.6,
vector_k=30
)
reranker = CrossEncoderReranker(
model_name="your-model"
)
summarizer = GroundedSummarizer(
mode=SummarizationMode.EXTRACTIVE,
faithfulness_threshold=0.90
)
config = PipelineConfig(
profile=ProfileType.CUSTOM,
topK=30,
latency_budget_ms=1000,
...
)
return cls(retriever, summarizer, config, reranker)
Adding a New Vector Store
from packages.rag.stores import VectorStore
class YourCustomStore(VectorStore):
def add_documents(self, documents): ...
def search(self, query_embedding, k): ...
def delete_documents(self, ids): ...
def get_stats(self): ...
# Use it
pipeline = Pipeline.create_profile(
profile=ProfileType.BALANCED,
vector_store=YourCustomStore(...),
chunks=chunks
)
Component Diagram
DocumentSearchPipeline
├── HybridRetriever
│ ├── BM25Retriever
│ └── VectorRetriever
│ └── VectorStore
│ ├── OpenSearchStore
│ ├── MongoDBStore
│ └── AzureAISearchStore
│
├── CrossEncoderReranker (optional)
│
└── GroundedSummarizer
├── Extractive (sumy/TextRank)
└── Abstractive (LangChain + OpenAI)
DocumentIndexingPipeline
├── DocumentLoader
├── SemanticChunker
├── VectorStore
└── S3DocumentStorage
Summary
Architectural Principles
- Modularity: Compose small, focused components
- Configuration: Profiles over code variations
- Strategy Pattern: Encapsulate profile configurations
- Dependency Injection: Loose coupling, easy testing
- Fail Gracefully: Preserve progress, handle errors
- Horizontal Scaling: Stateless design
Key Features
- Profile-based configuration (Balanced, Latency-First, Quality-First)
- Hybrid retrieval (BM25 + Vector with RRF)
- Optional cross-encoder reranking
- Grounded summarization with citations
- Two-tier storage (S3 + Vector Store)
- Multi-level caching
- Horizontal scalability
Next: Storage & Indexing | API Integration