Skip to main content

Architecture

System Design for Document Search & Summarization


Overview​

The Document Search & Summarization system is built on a modular, composable architecture that enables flexible configuration through profiles while maintaining high performance and reliability.


Core Components​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Document Search Pipeline β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Profile Configuration β”‚ β”‚
β”‚ β”‚ (Balanced | Latency | β”‚ β”‚
β”‚ β”‚ Quality) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ ↓ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Hybrid Retriever β”‚ β”‚
β”‚ β”‚ β€’ BM25 + Vector Search β”‚ β”‚
β”‚ β”‚ β€’ RRF Fusion β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ ↓ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Cross-Encoder Reranker β”‚ β”‚
β”‚ β”‚ (Optional) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ ↓ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Grounded Summarizer β”‚ β”‚
β”‚ β”‚ β€’ Extractive (TextRank) β”‚ β”‚
β”‚ β”‚ β€’ Abstractive (LLM) β”‚ β”‚
β”‚ β”‚ β€’ Citations β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Architectural Layers​

Layer 1: Storage & Indexing​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Vector Stores β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚OpenSearchβ”‚ β”‚ MongoDB β”‚ β”‚
β”‚ β”‚ (BM25+ β”‚ β”‚ (Vector) β”‚ β”‚
β”‚ β”‚ kNN) β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↑
VectorStore interface

Storage Strategy:

  • Vector Stores: OpenSearch, MongoDB Atlas, Azure AI Search
  • S3: Raw document storage with pre-signed URLs
  • Two-tier pattern: S3 for originals, vector stores for search

Layer 2: Retrieval & Ranking​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Retrieval Pipeline β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ BM25 β”‚ β”‚ Vector β”‚ β”‚
β”‚ β”‚Retriever β”‚ β”‚ Retriever β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ RRF Fusion β”‚
β”‚ ↓ β”‚
β”‚ HybridRetriever β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Reranker β”‚
β”‚ (Cross-Enc) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Features:

  • BM25Retriever: Keyword search with tunable k1, b parameters
  • VectorRetriever: Semantic similarity search
  • HybridRetriever: Reciprocal Rank Fusion (RRF) with Ξ±-weighting
  • CrossEncoderReranker: Deep reranking with circuit breaker protection

Layer 3: Summarization & Orchestration​

GroundedSummarizer provides two modes:

Extractive (TextRank):

  • Fast (50-200ms)
  • No LLM cost
  • 100% faithful to source
  • Sentence-level citations

Abstractive (LLM):

  • Fluent synthesis
  • 2-5s latency
  • Includes faithfulness verification
  • Hallucination detection

Data Flow​

End-to-End Document Journey​

1. Document Upload
β”œβ”€β”€ Parse PDF/DOCX/XLSX
└── Extract metadata

2. Storage
β”œβ”€β”€ Raw file β†’ S3
└── Generate pre-signed URL

3. Chunking
β”œβ”€β”€ Semantic chunking
└── Create overlapping chunks

4. Embedding
β”œβ”€β”€ Generate vectors
└── Batch processing

5. Indexing
β”œβ”€β”€ Store in vector database
└── Index for BM25

6. Search
β”œβ”€β”€ Hybrid retrieval (BM25 + Vector)
β”œβ”€β”€ RRF fusion
└── Return top-K results

7. Reranking (Optional)
β”œβ”€β”€ Cross-encoder scoring
└── Improved ranking

8. Summarization
β”œβ”€β”€ Extract/generate summary
β”œβ”€β”€ Add citations
β”œβ”€β”€ Verify faithfulness
└── Return grounded summary

Profile-Based Configuration​

The Strategy Pattern​

Different use cases require different performance/quality tradeoffs. Profiles encapsulate these strategies:

# Balanced Profile (General Purpose)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.BALANCED,
store, chunks
)
# Config: alpha=0.5, topK=20, light reranking, extractive summary

# Latency-First Profile (Interactive)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.LATENCY_FIRST,
store, chunks
)
# Config: alpha=0.7, topK=10, no reranking, extractive summary

# Quality-First Profile (Research)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.QUALITY_FIRST,
store, chunks
)
# Config: alpha=0.5, topK=50, deep reranking, abstractive summary

Profile Comparison​

ProfileLatencyQualityCostUse Case
Balanced500msGood (0.7-0.8)$0.60/1KGeneral Q&A
Latency-First250msAcceptable (0.7)$0.35/1KInteractive
Quality-First5sExcellent (0.85-0.95)$52/1KResearch

Design Patterns​

1. Dependency Injection​

Components are injected rather than hard-coded:

class DocumentSearchPipeline:
def __init__(self, retriever, summarizer, config, reranker=None):
self.retriever = retriever # Injected
self.reranker = reranker # Optional
self.summarizer = summarizer # Injected
self.config = config # Injected

Benefits: Easy testing, flexible configuration, loose coupling

2. Factory Method​

Hide complex initialization:

@classmethod
def create_profile(cls, profile: ProfileType, vector_store, chunks):
"""Factory method for profile-based initialization."""
if profile == ProfileType.BALANCED:
return cls._create_balanced_profile(vector_store, chunks)
elif profile == ProfileType.LATENCY_FIRST:
return cls._create_latency_first_profile(vector_store, chunks)
elif profile == ProfileType.QUALITY_FIRST:
return cls._create_quality_first_profile(vector_store, chunks)

3. Pipeline Pattern​

Break complex processes into discrete stages:

Input: Document File
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 1: Load β”‚
β”‚ Parse document β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 2: Store β”‚
β”‚ Upload to S3 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 3: Chunk β”‚
β”‚ Split into pieces β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 4: Embed β”‚
β”‚ Generate vectors β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 5: Index β”‚
β”‚ Store in database β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
Output: IndexingResult

Benefits: Independently testable stages, progress tracking, error isolation


Storage Architecture​

Two-Storage Pattern​

Different data types have different storage requirements:

Document Storage:
β”œβ”€β”€ Raw Files (S3)
β”‚ β”œβ”€β”€ Large files (100MB+)
β”‚ β”œβ”€β”€ Immutable originals
β”‚ └── Pre-signed URLs
β”‚
β”œβ”€β”€ Searchable Content (Vector Store)
β”‚ β”œβ”€β”€ Chunked and embedded
β”‚ β”œβ”€β”€ BM25 + kNN indexes
β”‚ └── Fast retrieval
β”‚
└── Metadata (MongoDB/PostgreSQL)
β”œβ”€β”€ Document info
β”œβ”€β”€ Processing status
└── Relationships

Why Two Storage Systems?​

NeedS3Vector StoreReason
Store 100MB PDFβœ…βŒVector stores optimized for small chunks
Full-text searchβŒβœ…S3 is object storage, not searchable
Keep originalβœ…βŒVector stores only keep chunks
Fast retrievalβŒβœ…Vector stores optimized for search
Pre-signed URLsβœ…βŒS3 feature for direct download

S3 Pre-Signed URLs​

Temporary, secure URLs for direct access:

# Generate URL valid for 1 hour
url = s3_storage.generate_presigned_url(
s3_key="documents/2025/10/doc123/report.pdf",
expiration=3600
)

Security Benefits:

  • Time-limited (expires after N seconds)
  • No credentials exposed
  • Can't access other objects
  • Auditable via CloudTrail

Scalability Architecture​

Horizontal Scaling​

Stateless design allows multiple instances:

           Load Balancer
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
↓ ↓ ↓
Pipeline Pipeline Pipeline
Instance 1 Instance 2 Instance 3
↓ ↓ ↓
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
Shared Storage
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
↓ ↓ ↓
OpenSearch MongoDB S3

Scaling Characteristics:

  • No state in pipeline instances
  • All state in shared stores
  • Add instances = more throughput
  • Each instance handles 100+ QPS

Multi-Level Caching​

Query comes in
↓
L1: Redis - Full result cache
Cache hit? β†’ Return (5ms)
Cache miss? ↓
L2: Redis - Retrieval cache
Cache hit? β†’ Skip to summary (100ms)
Cache miss? ↓
L3: Execute full pipeline (500ms)
↓
Cache results in L1 & L2

Performance Impact:

  • 40-60% cache hit rate
  • 95% latency reduction on cache hits
  • 50-70% cost reduction

Error Handling​

Fail Gracefully Strategy​

Preserve progress through pipeline stages:

def index_document(file_path):
result = IndexingResult(status='started')

try:
# Stage 1: Load
doc = load_document(file_path)
result.doc_loaded = True
except Exception as e:
result.status = 'failed'
result.error = f"Loading failed: {e}"
return result # Exit early

try:
# Stage 2: S3 (non-fatal)
s3_key = upload_to_s3(doc)
result.s3_uploaded = True
except Exception as e:
log_warning(f"S3 upload failed: {e}")
# Continue without S3

# ... continue through stages

Key Principles:

  • Some failures are fatal (stop processing)
  • Some failures are warnings (continue)
  • Always preserve progress state
  • Return partial results when possible

Design Tradeoffs​

Performance vs Quality​

Solved through profiles:

ProfilePerformanceQualityCost
Latency-First⭐⭐⭐⭐⭐⭐
Balanced⭐⭐⭐⭐⭐⭐
Quality-First⭐⭐⭐⭐⭐⭐⭐

Extractive vs Abstractive​

Extractive (Default):

  • βœ… Fast (50-200ms)
  • βœ… No LLM cost
  • βœ… 100% faithful
  • ❌ Less fluent

Abstractive (Opt-in):

  • βœ… Fluent synthesis
  • βœ… Better readability
  • ❌ Slow (2-5s)
  • ❌ Expensive ($0.01-0.10)
  • ❌ Hallucination risk

Simplicity vs Flexibility​

Simple (90% use case):

pipeline = Pipeline.create_profile(ProfileType.BALANCED, ...)

Flexible (10% use case):

custom_config = PipelineConfig(
topK=30,
alpha=0.6,
enable_reranking=True,
latency_budget_ms=1000
)
pipeline = Pipeline(..., config=custom_config)

Extension Points​

Adding a New Profile​

@classmethod
def _create_custom_profile(cls, store, chunks):
"""Add your custom profile."""

retriever = HybridRetriever(
...,
alpha=0.6,
vector_k=30
)

reranker = CrossEncoderReranker(
model_name="your-model"
)

summarizer = GroundedSummarizer(
mode=SummarizationMode.EXTRACTIVE,
faithfulness_threshold=0.90
)

config = PipelineConfig(
profile=ProfileType.CUSTOM,
topK=30,
latency_budget_ms=1000,
...
)

return cls(retriever, summarizer, config, reranker)

Adding a New Vector Store​

from packages.rag.stores import VectorStore

class YourCustomStore(VectorStore):
def add_documents(self, documents): ...
def search(self, query_embedding, k): ...
def delete_documents(self, ids): ...
def get_stats(self): ...

# Use it
pipeline = Pipeline.create_profile(
profile=ProfileType.BALANCED,
vector_store=YourCustomStore(...),
chunks=chunks
)

Component Diagram​

DocumentSearchPipeline
β”œβ”€β”€ HybridRetriever
β”‚ β”œβ”€β”€ BM25Retriever
β”‚ └── VectorRetriever
β”‚ └── VectorStore
β”‚ β”œβ”€β”€ OpenSearchStore
β”‚ β”œβ”€β”€ MongoDBStore
β”‚ └── AzureAISearchStore
β”‚
β”œβ”€β”€ CrossEncoderReranker (optional)
β”‚
└── GroundedSummarizer
β”œβ”€β”€ Extractive (sumy/TextRank)
└── Abstractive (LangChain + OpenAI)

DocumentIndexingPipeline
β”œβ”€β”€ DocumentLoader
β”œβ”€β”€ SemanticChunker
β”œβ”€β”€ VectorStore
└── S3DocumentStorage

Summary​

Architectural Principles​

  1. Modularity: Compose small, focused components
  2. Configuration: Profiles over code variations
  3. Strategy Pattern: Encapsulate profile configurations
  4. Dependency Injection: Loose coupling, easy testing
  5. Fail Gracefully: Preserve progress, handle errors
  6. Horizontal Scaling: Stateless design

Key Features​

  • Profile-based configuration (Balanced, Latency-First, Quality-First)
  • Hybrid retrieval (BM25 + Vector with RRF)
  • Optional cross-encoder reranking
  • Grounded summarization with citations
  • Two-tier storage (S3 + Vector Store)
  • Multi-level caching
  • Horizontal scalability

Next: Storage & Indexing | API Integration