Skip to main content

Architecture

System Design for Document Search & Summarization


Overview

The Document Search & Summarization system is built on a modular, composable architecture that enables flexible configuration through profiles while maintaining high performance and reliability.


Core Components

┌─────────────────────────────────────┐
│ Document Search Pipeline │
│ │
│ ┌──────────────────────────────┐ │
│ │ Profile Configuration │ │
│ │ (Balanced | Latency | │ │
│ │ Quality) │ │
│ └──────────┬───────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────┐ │
│ │ Hybrid Retriever │ │
│ │ • BM25 + Vector Search │ │
│ │ • RRF Fusion │ │
│ └──────────┬───────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────┐ │
│ │ Cross-Encoder Reranker │ │
│ │ (Optional) │ │
│ └──────────┬───────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────┐ │
│ │ Grounded Summarizer │ │
│ │ • Extractive (TextRank) │ │
│ │ • Abstractive (LLM) │ │
│ │ • Citations │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘

Architectural Layers

Layer 1: Storage & Indexing

┌──────────────────────────────────────┐
│ Vector Stores │
│ ┌──────────┐ ┌──────────┐ │
│ │OpenSearch│ │ MongoDB │ │
│ │ (BM25+ │ │ (Vector) │ │
│ │ kNN) │ │ │ │
│ └──────────┘ └──────────┘ │
└──────────────────────────────────────┘

VectorStore interface

Storage Strategy:

  • Vector Stores: OpenSearch, MongoDB Atlas, Azure AI Search
  • S3: Raw document storage with pre-signed URLs
  • Two-tier pattern: S3 for originals, vector stores for search

Layer 2: Retrieval & Ranking

┌──────────────────────────────────────┐
│ Retrieval Pipeline │
│ │
│ ┌──────────┐ ┌────────────┐ │
│ │ BM25 │ │ Vector │ │
│ │Retriever │ │ Retriever │ │
│ └────┬─────┘ └─────┬──────┘ │
│ └────────┬───────┘ │
│ RRF Fusion │
│ ↓ │
│ HybridRetriever │
└──────────────────┬───────────────────┘

┌───────────────┐
│ Reranker │
│ (Cross-Enc) │
└───────────────┘

Key Features:

  • BM25Retriever: Keyword search with tunable k1, b parameters
  • VectorRetriever: Semantic similarity search
  • HybridRetriever: Reciprocal Rank Fusion (RRF) with α-weighting
  • CrossEncoderReranker: Deep reranking with circuit breaker protection

Layer 3: Summarization & Orchestration

GroundedSummarizer provides two modes:

Extractive (TextRank):

  • Fast (50-200ms)
  • No LLM cost
  • 100% faithful to source
  • Sentence-level citations

Abstractive (LLM):

  • Fluent synthesis
  • 2-5s latency
  • Includes faithfulness verification
  • Hallucination detection

Data Flow

End-to-End Document Journey

1. Document Upload
├── Parse PDF/DOCX/XLSX
└── Extract metadata

2. Storage
├── Raw file → S3
└── Generate pre-signed URL

3. Chunking
├── Semantic chunking
└── Create overlapping chunks

4. Embedding
├── Generate vectors
└── Batch processing

5. Indexing
├── Store in vector database
└── Index for BM25

6. Search
├── Hybrid retrieval (BM25 + Vector)
├── RRF fusion
└── Return top-K results

7. Reranking (Optional)
├── Cross-encoder scoring
└── Improved ranking

8. Summarization
├── Extract/generate summary
├── Add citations
├── Verify faithfulness
└── Return grounded summary

Profile-Based Configuration

The Strategy Pattern

Different use cases require different performance/quality tradeoffs. Profiles encapsulate these strategies:

# Balanced Profile (General Purpose)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.BALANCED,
store, chunks
)
# Config: alpha=0.5, topK=20, light reranking, extractive summary

# Latency-First Profile (Interactive)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.LATENCY_FIRST,
store, chunks
)
# Config: alpha=0.7, topK=10, no reranking, extractive summary

# Quality-First Profile (Research)
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.QUALITY_FIRST,
store, chunks
)
# Config: alpha=0.5, topK=50, deep reranking, abstractive summary

Profile Comparison

ProfileLatencyQualityCostUse Case
Balanced500msGood (0.7-0.8)$0.60/1KGeneral Q&A
Latency-First250msAcceptable (0.7)$0.35/1KInteractive
Quality-First5sExcellent (0.85-0.95)$52/1KResearch

Design Patterns

1. Dependency Injection

Components are injected rather than hard-coded:

class DocumentSearchPipeline:
def __init__(self, retriever, summarizer, config, reranker=None):
self.retriever = retriever # Injected
self.reranker = reranker # Optional
self.summarizer = summarizer # Injected
self.config = config # Injected

Benefits: Easy testing, flexible configuration, loose coupling

2. Factory Method

Hide complex initialization:

@classmethod
def create_profile(cls, profile: ProfileType, vector_store, chunks):
"""Factory method for profile-based initialization."""
if profile == ProfileType.BALANCED:
return cls._create_balanced_profile(vector_store, chunks)
elif profile == ProfileType.LATENCY_FIRST:
return cls._create_latency_first_profile(vector_store, chunks)
elif profile == ProfileType.QUALITY_FIRST:
return cls._create_quality_first_profile(vector_store, chunks)

3. Pipeline Pattern

Break complex processes into discrete stages:

Input: Document File

┌─────────────────────┐
│ Stage 1: Load │
│ Parse document │
└──────────┬──────────┘

┌─────────────────────┐
│ Stage 2: Store │
│ Upload to S3 │
└──────────┬──────────┘

┌─────────────────────┐
│ Stage 3: Chunk │
│ Split into pieces │
└──────────┬──────────┘

┌─────────────────────┐
│ Stage 4: Embed │
│ Generate vectors │
└──────────┬──────────┘

┌─────────────────────┐
│ Stage 5: Index │
│ Store in database │
└──────────┬──────────┘

Output: IndexingResult

Benefits: Independently testable stages, progress tracking, error isolation


Storage Architecture

Two-Storage Pattern

Different data types have different storage requirements:

Document Storage:
├── Raw Files (S3)
│ ├── Large files (100MB+)
│ ├── Immutable originals
│ └── Pre-signed URLs

├── Searchable Content (Vector Store)
│ ├── Chunked and embedded
│ ├── BM25 + kNN indexes
│ └── Fast retrieval

└── Metadata (MongoDB/PostgreSQL)
├── Document info
├── Processing status
└── Relationships

Why Two Storage Systems?

NeedS3Vector StoreReason
Store 100MB PDFVector stores optimized for small chunks
Full-text searchS3 is object storage, not searchable
Keep originalVector stores only keep chunks
Fast retrievalVector stores optimized for search
Pre-signed URLsS3 feature for direct download

S3 Pre-Signed URLs

Temporary, secure URLs for direct access:

# Generate URL valid for 1 hour
url = s3_storage.generate_presigned_url(
s3_key="documents/2025/10/doc123/report.pdf",
expiration=3600
)

Security Benefits:

  • Time-limited (expires after N seconds)
  • No credentials exposed
  • Can't access other objects
  • Auditable via CloudTrail

Scalability Architecture

Horizontal Scaling

Stateless design allows multiple instances:

           Load Balancer

┌──────────┼──────────┐
↓ ↓ ↓
Pipeline Pipeline Pipeline
Instance 1 Instance 2 Instance 3
↓ ↓ ↓
└──────────┼──────────┘

Shared Storage
┌──────────┼──────────┐
↓ ↓ ↓
OpenSearch MongoDB S3

Scaling Characteristics:

  • No state in pipeline instances
  • All state in shared stores
  • Add instances = more throughput
  • Each instance handles 100+ QPS

Multi-Level Caching

Query comes in

L1: Redis - Full result cache
Cache hit? → Return (5ms)
Cache miss? ↓
L2: Redis - Retrieval cache
Cache hit? → Skip to summary (100ms)
Cache miss? ↓
L3: Execute full pipeline (500ms)

Cache results in L1 & L2

Performance Impact:

  • 40-60% cache hit rate
  • 95% latency reduction on cache hits
  • 50-70% cost reduction

Error Handling

Fail Gracefully Strategy

Preserve progress through pipeline stages:

def index_document(file_path):
result = IndexingResult(status='started')

try:
# Stage 1: Load
doc = load_document(file_path)
result.doc_loaded = True
except Exception as e:
result.status = 'failed'
result.error = f"Loading failed: {e}"
return result # Exit early

try:
# Stage 2: S3 (non-fatal)
s3_key = upload_to_s3(doc)
result.s3_uploaded = True
except Exception as e:
log_warning(f"S3 upload failed: {e}")
# Continue without S3

# ... continue through stages

Key Principles:

  • Some failures are fatal (stop processing)
  • Some failures are warnings (continue)
  • Always preserve progress state
  • Return partial results when possible

Design Tradeoffs

Performance vs Quality

Solved through profiles:

ProfilePerformanceQualityCost
Latency-First⭐⭐⭐⭐⭐
Balanced⭐⭐⭐⭐⭐⭐
Quality-First⭐⭐⭐⭐⭐⭐

Extractive vs Abstractive

Extractive (Default):

  • ✅ Fast (50-200ms)
  • ✅ No LLM cost
  • ✅ 100% faithful
  • ❌ Less fluent

Abstractive (Opt-in):

  • ✅ Fluent synthesis
  • ✅ Better readability
  • ❌ Slow (2-5s)
  • ❌ Expensive ($0.01-0.10)
  • ❌ Hallucination risk

Simplicity vs Flexibility

Simple (90% use case):

pipeline = Pipeline.create_profile(ProfileType.BALANCED, ...)

Flexible (10% use case):

custom_config = PipelineConfig(
topK=30,
alpha=0.6,
enable_reranking=True,
latency_budget_ms=1000
)
pipeline = Pipeline(..., config=custom_config)

Extension Points

Adding a New Profile

@classmethod
def _create_custom_profile(cls, store, chunks):
"""Add your custom profile."""

retriever = HybridRetriever(
...,
alpha=0.6,
vector_k=30
)

reranker = CrossEncoderReranker(
model_name="your-model"
)

summarizer = GroundedSummarizer(
mode=SummarizationMode.EXTRACTIVE,
faithfulness_threshold=0.90
)

config = PipelineConfig(
profile=ProfileType.CUSTOM,
topK=30,
latency_budget_ms=1000,
...
)

return cls(retriever, summarizer, config, reranker)

Adding a New Vector Store

from packages.rag.stores import VectorStore

class YourCustomStore(VectorStore):
def add_documents(self, documents): ...
def search(self, query_embedding, k): ...
def delete_documents(self, ids): ...
def get_stats(self): ...

# Use it
pipeline = Pipeline.create_profile(
profile=ProfileType.BALANCED,
vector_store=YourCustomStore(...),
chunks=chunks
)

Component Diagram

DocumentSearchPipeline
├── HybridRetriever
│ ├── BM25Retriever
│ └── VectorRetriever
│ └── VectorStore
│ ├── OpenSearchStore
│ ├── MongoDBStore
│ └── AzureAISearchStore

├── CrossEncoderReranker (optional)

└── GroundedSummarizer
├── Extractive (sumy/TextRank)
└── Abstractive (LangChain + OpenAI)

DocumentIndexingPipeline
├── DocumentLoader
├── SemanticChunker
├── VectorStore
└── S3DocumentStorage

Summary

Architectural Principles

  1. Modularity: Compose small, focused components
  2. Configuration: Profiles over code variations
  3. Strategy Pattern: Encapsulate profile configurations
  4. Dependency Injection: Loose coupling, easy testing
  5. Fail Gracefully: Preserve progress, handle errors
  6. Horizontal Scaling: Stateless design

Key Features

  • Profile-based configuration (Balanced, Latency-First, Quality-First)
  • Hybrid retrieval (BM25 + Vector with RRF)
  • Optional cross-encoder reranking
  • Grounded summarization with citations
  • Two-tier storage (S3 + Vector Store)
  • Multi-level caching
  • Horizontal scalability

Next: Storage & Indexing | API Integration