Architecture

System Design for Document Search & Summarization

Overview

The Document Search & Summarization system is built on a modular, composable architecture that enables flexible configuration through profiles while maintaining high performance and reliability.

Core Components

┌─────────────────────────────────────┐
│   Document Search Pipeline          │
│                                     │
│  ┌──────────────────────────────┐  │
│  │  Profile Configuration       │  │
│  │  (Balanced | Latency |       │  │
│  │   Quality)                   │  │
│  └──────────┬───────────────────┘  │
│             ↓                       │
│  ┌──────────────────────────────┐  │
│  │  Hybrid Retriever            │  │
│  │  • BM25 + Vector Search      │  │
│  │  • RRF Fusion                │  │
│  └──────────┬───────────────────┘  │
│             ↓                       │
│  ┌──────────────────────────────┐  │
│  │  Cross-Encoder Reranker      │  │
│  │  (Optional)                  │  │
│  └──────────┬───────────────────┘  │
│             ↓                       │
│  ┌──────────────────────────────┐  │
│  │  Grounded Summarizer         │  │
│  │  • Extractive (TextRank)     │  │
│  │  • Abstractive (LLM)         │  │
│  │  • Citations                 │  │
│  └──────────────────────────────┘  │
└─────────────────────────────────────┘

Architectural Layers

Layer 1: Storage & Indexing

┌──────────────────────────────────────┐
│        Vector Stores                 │
│  ┌──────────┐  ┌──────────┐         │
│  │OpenSearch│  │ MongoDB  │         │
│  │  (BM25+  │  │ (Vector) │         │
│  │   kNN)   │  │          │         │
│  └──────────┘  └──────────┘         │
└──────────────────────────────────────┘
         ↑
    VectorStore interface

Storage Strategy:

Vector Stores: OpenSearch, MongoDB Atlas, Azure AI Search
S3: Raw document storage with pre-signed URLs
Two-tier pattern: S3 for originals, vector stores for search

Layer 2: Retrieval & Ranking

┌──────────────────────────────────────┐
│      Retrieval Pipeline              │
│                                      │
│  ┌──────────┐   ┌────────────┐     │
│  │  BM25    │   │   Vector   │     │
│  │Retriever │   │  Retriever │     │
│  └────┬─────┘   └─────┬──────┘     │
│       └────────┬───────┘            │
│              RRF Fusion             │
│                 ↓                   │
│         HybridRetriever             │
└──────────────────┬───────────────────┘
                   ↓
           ┌───────────────┐
           │  Reranker     │
           │ (Cross-Enc)   │
           └───────────────┘

Key Features:

BM25Retriever: Keyword search with tunable k1, b parameters
VectorRetriever: Semantic similarity search
HybridRetriever: Reciprocal Rank Fusion (RRF) with α-weighting
CrossEncoderReranker: Deep reranking with circuit breaker protection

Layer 3: Summarization & Orchestration

GroundedSummarizer provides two modes:

Extractive (TextRank):

Fast (50-200ms)
No LLM cost
100% faithful to source
Sentence-level citations

Abstractive (LLM):

Fluent synthesis
2-5s latency
Includes faithfulness verification
Hallucination detection

Data Flow

End-to-End Document Journey

1. Document Upload
   ├── Parse PDF/DOCX/XLSX
   └── Extract metadata

2. Storage
   ├── Raw file → S3
   └── Generate pre-signed URL

3. Chunking
   ├── Semantic chunking
   └── Create overlapping chunks

4. Embedding
   ├── Generate vectors
   └── Batch processing

5. Indexing
   ├── Store in vector database
   └── Index for BM25

6. Search
   ├── Hybrid retrieval (BM25 + Vector)
   ├── RRF fusion
   └── Return top-K results

7. Reranking (Optional)
   ├── Cross-encoder scoring
   └── Improved ranking

8. Summarization
   ├── Extract/generate summary
   ├── Add citations
   ├── Verify faithfulness
   └── Return grounded summary

Profile-Based Configuration

The Strategy Pattern

Different use cases require different performance/quality tradeoffs. Profiles encapsulate these strategies:

# Balanced Profile (General Purpose)
pipeline = DocumentSearchPipeline.create_profile(
    ProfileType.BALANCED,
    store, chunks
)
# Config: alpha=0.5, topK=20, light reranking, extractive summary

# Latency-First Profile (Interactive)
pipeline = DocumentSearchPipeline.create_profile(
    ProfileType.LATENCY_FIRST,
    store, chunks
)
# Config: alpha=0.7, topK=10, no reranking, extractive summary

# Quality-First Profile (Research)
pipeline = DocumentSearchPipeline.create_profile(
    ProfileType.QUALITY_FIRST,
    store, chunks
)
# Config: alpha=0.5, topK=50, deep reranking, abstractive summary

Profile Comparison

Profile	Latency	Quality	Cost	Use Case
Balanced	500ms	Good (0.7-0.8)	$0.60/1K	General Q&A
Latency-First	250ms	Acceptable (0.7)	$0.35/1K	Interactive
Quality-First	5s	Excellent (0.85-0.95)	$52/1K	Research

Design Patterns

1. Dependency Injection

Components are injected rather than hard-coded:

class DocumentSearchPipeline:
    def __init__(self, retriever, summarizer, config, reranker=None):
        self.retriever = retriever      # Injected
        self.reranker = reranker        # Optional
        self.summarizer = summarizer    # Injected
        self.config = config            # Injected

Benefits: Easy testing, flexible configuration, loose coupling

2. Factory Method

Hide complex initialization:

@classmethod
def create_profile(cls, profile: ProfileType, vector_store, chunks):
    """Factory method for profile-based initialization."""
    if profile == ProfileType.BALANCED:
        return cls._create_balanced_profile(vector_store, chunks)
    elif profile == ProfileType.LATENCY_FIRST:
        return cls._create_latency_first_profile(vector_store, chunks)
    elif profile == ProfileType.QUALITY_FIRST:
        return cls._create_quality_first_profile(vector_store, chunks)

3. Pipeline Pattern

Break complex processes into discrete stages:

Input: Document File
    ↓
┌─────────────────────┐
│  Stage 1: Load      │
│  Parse document     │
└──────────┬──────────┘
           ↓
┌─────────────────────┐
│  Stage 2: Store     │
│  Upload to S3       │
└──────────┬──────────┘
           ↓
┌─────────────────────┐
│  Stage 3: Chunk     │
│  Split into pieces  │
└──────────┬──────────┘
           ↓
┌─────────────────────┐
│  Stage 4: Embed     │
│  Generate vectors   │
└──────────┬──────────┘
           ↓
┌─────────────────────┐
│  Stage 5: Index     │
│  Store in database  │
└──────────┬──────────┘
           ↓
Output: IndexingResult

Benefits: Independently testable stages, progress tracking, error isolation

Storage Architecture

Two-Storage Pattern

Different data types have different storage requirements:

Document Storage:
├── Raw Files (S3)
│   ├── Large files (100MB+)
│   ├── Immutable originals
│   └── Pre-signed URLs
│
├── Searchable Content (Vector Store)
│   ├── Chunked and embedded
│   ├── BM25 + kNN indexes
│   └── Fast retrieval
│
└── Metadata (MongoDB/PostgreSQL)
    ├── Document info
    ├── Processing status
    └── Relationships

Why Two Storage Systems?

Need	S3	Vector Store	Reason
Store 100MB PDF	✅	❌	Vector stores optimized for small chunks
Full-text search	❌	✅	S3 is object storage, not searchable
Keep original	✅	❌	Vector stores only keep chunks
Fast retrieval	❌	✅	Vector stores optimized for search
Pre-signed URLs	✅	❌	S3 feature for direct download

S3 Pre-Signed URLs

Temporary, secure URLs for direct access:

# Generate URL valid for 1 hour
url = s3_storage.generate_presigned_url(
    s3_key="documents/2025/10/doc123/report.pdf",
    expiration=3600
)

Security Benefits:

Time-limited (expires after N seconds)
No credentials exposed
Can't access other objects
Auditable via CloudTrail

Scalability Architecture

Horizontal Scaling

Stateless design allows multiple instances:

           Load Balancer
                ↓
     ┌──────────┼──────────┐
     ↓          ↓          ↓
Pipeline    Pipeline    Pipeline
Instance 1  Instance 2  Instance 3
     ↓          ↓          ↓
     └──────────┼──────────┘
                ↓
        Shared Storage
     ┌──────────┼──────────┐
     ↓          ↓          ↓
OpenSearch   MongoDB     S3

Scaling Characteristics:

No state in pipeline instances
All state in shared stores
Add instances = more throughput
Each instance handles 100+ QPS

Multi-Level Caching

Query comes in
    ↓
L1: Redis - Full result cache
    Cache hit? → Return (5ms)
    Cache miss? ↓
L2: Redis - Retrieval cache
    Cache hit? → Skip to summary (100ms)
    Cache miss? ↓
L3: Execute full pipeline (500ms)
    ↓
Cache results in L1 & L2

Performance Impact:

40-60% cache hit rate
95% latency reduction on cache hits
50-70% cost reduction

Error Handling

Fail Gracefully Strategy

Preserve progress through pipeline stages:

def index_document(file_path):
    result = IndexingResult(status='started')
    
    try:
        # Stage 1: Load
        doc = load_document(file_path)
        result.doc_loaded = True
    except Exception as e:
        result.status = 'failed'
        result.error = f"Loading failed: {e}"
        return result  # Exit early
    
    try:
        # Stage 2: S3 (non-fatal)
        s3_key = upload_to_s3(doc)
        result.s3_uploaded = True
    except Exception as e:
        log_warning(f"S3 upload failed: {e}")
        # Continue without S3
    
    # ... continue through stages

Key Principles:

Some failures are fatal (stop processing)
Some failures are warnings (continue)
Always preserve progress state
Return partial results when possible

Design Tradeoffs

Performance vs Quality

Solved through profiles:

Profile	Performance	Quality	Cost
Latency-First	⭐⭐⭐	⭐⭐	⭐
Balanced	⭐⭐	⭐⭐	⭐⭐
Quality-First	⭐	⭐⭐⭐	⭐⭐⭐

Extractive vs Abstractive

Extractive (Default):

✅ Fast (50-200ms)
✅ No LLM cost
✅ 100% faithful
❌ Less fluent

Abstractive (Opt-in):

✅ Fluent synthesis
✅ Better readability
❌ Slow (2-5s)
❌ Expensive ($0.01-0.10)
❌ Hallucination risk

Simplicity vs Flexibility

Simple (90% use case):

pipeline = Pipeline.create_profile(ProfileType.BALANCED, ...)

Flexible (10% use case):

custom_config = PipelineConfig(
    topK=30,
    alpha=0.6,
    enable_reranking=True,
    latency_budget_ms=1000
)
pipeline = Pipeline(..., config=custom_config)

Extension Points

Adding a New Profile

@classmethod
def _create_custom_profile(cls, store, chunks):
    """Add your custom profile."""
    
    retriever = HybridRetriever(
        ...,
        alpha=0.6,
        vector_k=30
    )
    
    reranker = CrossEncoderReranker(
        model_name="your-model"
    )
    
    summarizer = GroundedSummarizer(
        mode=SummarizationMode.EXTRACTIVE,
        faithfulness_threshold=0.90
    )
    
    config = PipelineConfig(
        profile=ProfileType.CUSTOM,
        topK=30,
        latency_budget_ms=1000,
        ...
    )
    
    return cls(retriever, summarizer, config, reranker)

Adding a New Vector Store

from packages.rag.stores import VectorStore

class YourCustomStore(VectorStore):
    def add_documents(self, documents): ...
    def search(self, query_embedding, k): ...
    def delete_documents(self, ids): ...
    def get_stats(self): ...

# Use it
pipeline = Pipeline.create_profile(
    profile=ProfileType.BALANCED,
    vector_store=YourCustomStore(...),
    chunks=chunks
)

Component Diagram

DocumentSearchPipeline
    ├── HybridRetriever
    │   ├── BM25Retriever
    │   └── VectorRetriever
    │       └── VectorStore
    │           ├── OpenSearchStore
    │           ├── MongoDBStore
    │           └── AzureAISearchStore
    │
    ├── CrossEncoderReranker (optional)
    │
    └── GroundedSummarizer
        ├── Extractive (sumy/TextRank)
        └── Abstractive (LangChain + OpenAI)

DocumentIndexingPipeline
    ├── DocumentLoader
    ├── SemanticChunker
    ├── VectorStore
    └── S3DocumentStorage

Summary

Architectural Principles

Modularity: Compose small, focused components
Configuration: Profiles over code variations
Strategy Pattern: Encapsulate profile configurations
Dependency Injection: Loose coupling, easy testing
Fail Gracefully: Preserve progress, handle errors
Horizontal Scaling: Stateless design

Key Features

Profile-based configuration (Balanced, Latency-First, Quality-First)
Hybrid retrieval (BM25 + Vector with RRF)
Optional cross-encoder reranking
Grounded summarization with citations
Two-tier storage (S3 + Vector Store)
Multi-level caching
Horizontal scalability

Next: Storage & Indexing | API Integration

Overview​

Core Components​

Architectural Layers​

Layer 1: Storage & Indexing​

Layer 2: Retrieval & Ranking​

Layer 3: Summarization & Orchestration​

Data Flow​

End-to-End Document Journey​

Profile-Based Configuration​

The Strategy Pattern​

Profile Comparison​

Design Patterns​

1. Dependency Injection​

2. Factory Method​

3. Pipeline Pattern​

Storage Architecture​

Two-Storage Pattern​

Why Two Storage Systems?​

S3 Pre-Signed URLs​

Scalability Architecture​

Horizontal Scaling​

Multi-Level Caching​

Error Handling​

Fail Gracefully Strategy​

Design Tradeoffs​

Performance vs Quality​

Extractive vs Abstractive​

Simplicity vs Flexibility​

Extension Points​

Adding a New Profile​

Adding a New Vector Store​

Component Diagram​

Summary​

Architectural Principles​

Key Features​