Document Search & Summarization: A Comprehensive Guide

A deep dive into hybrid retrieval, grounded summarization, and production-ready document intelligence

Overview

Document Search & Summarization is a sophisticated service that combines cutting-edge information retrieval techniques with grounded text summarization to provide accurate, cited, and trustworthy answers from document collections. This guide explores the theoretical foundations, architectural decisions, and practical implementation of a production-ready system.

What You'll Learn

Information Retrieval Theory: Understanding BM25, vector embeddings, and hybrid search
Summarization Techniques: Extractive vs. abstractive, grounding, and citation management
Profile-Based Architecture: Designing for different SLOs (latency, quality, cost)
Production Patterns: Fail-closed design, faithfulness verification, and evaluation
Implementation: Building and deploying a complete system

Theoretical Foundations
Architecture & Design
Information Retrieval Deep Dive
Summarization Techniques
Profile-Based Configuration
Implementation Guide
Evaluation & Quality Assurance
Production Best Practices
Advanced Topics

Theoretical Foundations

The Information Retrieval Problem

Core Challenge: Given a large collection of documents and a user query, how do we find the most relevant information quickly and accurately?

This is a classical problem in computer science with applications in search engines, question answering, and knowledge management. The challenge has multiple dimensions:

Relevance: How well does a document answer the query?
Recall: Are we finding all relevant documents?
Precision: Are the results actually relevant?
Latency: How quickly can we retrieve results?
Scalability: Can the system handle millions of documents?

Evolution of Search

1. Keyword Search (1970s-2000s)

Approach: Match exact words between query and documents.

Example:

Query: "machine learning algorithms"
Matches: Documents containing "machine", "learning", AND "algorithms"

Limitations:

Misses synonyms ("ML" vs "machine learning")
Ignores semantics ("hot dog" vs "dog that is hot")
Vocabulary mismatch between query and document

2. TF-IDF (Term Frequency-Inverse Document Frequency)

Innovation: Weight terms by their importance.

Theory:

TF (Term Frequency): How often a term appears in a document
```
TF(term, doc) = count(term in doc) / total_terms_in_doc
```
IDF (Inverse Document Frequency): How rare a term is across all documents
```
IDF(term) = log(total_documents / documents_containing_term)
```

Combined Score:

TF-IDF(term, doc) = TF(term, doc) × IDF(term)

Insight: Common words like "the" get low scores (high document frequency), while rare, informative words get high scores.

3. BM25 (Best Matching 25) - 1990s

Evolution: Improved TF-IDF with term saturation and document length normalization.

Formula:

Score(D,Q) = Σ IDF(qᵢ) × (f(qᵢ,D) × (k₁ + 1)) / (f(qᵢ,D) + k₁ × (1 - b + b × |D|/avgdl))

Where:
- D = document
- Q = query
- qᵢ = query term
- f(qᵢ,D) = term frequency in document
- |D| = document length
- avgdl = average document length
- k₁ = term frequency saturation parameter (typically 1.2)
- b = length normalization parameter (typically 0.75)

Key Improvements:

Term Saturation: After a certain point, more occurrences don't increase relevance much
Length Normalization: Adjust for document length (longer docs naturally have more term matches)
Tunable Parameters: k₁ and b allow customization for different corpora

Why BM25 Still Matters (2025): Despite being 30+ years old, BM25 remains highly effective for:

Exact keyword matching
Technical queries with specific terminology
Queries where word choice matters
Low-latency requirements (no neural network inference)

4. Vector Embeddings & Semantic Search (2010s-present)

Paradigm Shift: Represent text as dense vectors in continuous space where semantic similarity = geometric proximity.

How It Works:

Text → Vector:

"machine learning" → [0.23, -0.45, 0.67, ..., 0.12]  # 768-3072 dimensions

Similarity = Cosine Distance:

similarity(v₁, v₂) = (v₁ · v₂) / (||v₁|| × ||v₂||)

Semantic Understanding:

"ML algorithms" ≈ "machine learning methods" ≈ "AI techniques"

All map to similar regions in vector space!

Breakthrough Models:

Word2Vec (2013): Word-level embeddings
BERT (2018): Contextual embeddings
Sentence-BERT (2019): Optimized for semantic search
OpenAI Ada/text-embedding-3 (2024): 3072-dimensional embeddings

Advantages:

✅ Handles synonyms and paraphrasing
✅ Understands semantic similarity
✅ Cross-lingual capabilities
✅ Captures context

Limitations:

❌ May miss exact keyword matches
❌ Higher computational cost
❌ "Black box" - harder to explain why results matched

The Case for Hybrid Search

Key Insight: BM25 and vector search are complementary, not competitive.

Aspect	BM25	Vector Search
Strength	Exact keywords, technical terms	Semantic similarity, paraphrasing
Speed	Very fast	Requires inference
Explainability	Clear (term matching)	Opaque (neural network)
Handles Synonyms	❌ No	✅ Yes
Exact Matches	✅ Excellent	❌ May miss
Cost	Free	Embedding costs

Solution: Combine both approaches!

Architecture & Design

Profile-Based Architecture

Design Philosophy: Different use cases require different tradeoffs between latency, quality, and cost.

Instead of one-size-fits-all, we define profiles with specific SLOs (Service Level Objectives):

               User Need
                   ↓
          Profile Selection
         /        |        \
    Balanced  Latency   Quality
       ↓          ↓         ↓
   Different Configurations
       ↓          ↓         ↓
   Different SLOs

Three Profiles

Profile 1: Balanced (General-Purpose)

Use Case: Customer support, knowledge bases, general Q&A

SLOs:

Latency: < 500ms (P95)
Quality: Good (0.7-0.8 metrics)
Cost: $0.60 per 1,000 queries

Configuration:

retrieval:
  topK: 20
  alpha: 0.5  # Equal weight BM25 and vector
  query_expansion: PRF  # Pseudo-Relevance Feedback
  
reranking:
  enabled: true
  model: cross-encoder/ms-marco-MiniLM-L-6-v2  # Light model
  
summarization:
  mode: extractive  # Fast, no LLM costs
  max_length: 250 words
  citations: inline

Why This Works:

Hybrid search catches both exact matches and semantic similarity
Light reranking improves top results without major latency hit
Extractive summarization is fast and always faithful to source

Profile 2: Latency-First (Interactive)

Use Case: Auto-complete, real-time chat, instant search

SLOs:

Latency: < 250ms (P95)
Quality: Acceptable (0.7 metrics)
Cost: $0.35 per 1,000 queries

Configuration:

retrieval:
  topK: 10  # Fewer results for speed
  alpha: 0.7  # Favor BM25 (faster than vector)
  query_expansion: disabled  # Skip for speed
  
reranking:
  enabled: false  # Skip for speed
  
summarization:
  mode: extractive
  max_length: 150 words  # Shorter
  citations: inline

Optimization Strategies:

Favor BM25 (no neural network inference)
Reduce topK (less processing)
Skip query expansion and reranking
Cache aggressively

Profile 3: Quality-First (Research)

Use Case: Compliance, legal research, scientific literature review

SLOs:

Latency: < 5000ms (P95)
Quality: Excellent (0.85-0.95 metrics)
Cost: $52.50 per 1,000 queries

Configuration:

retrieval:
  topK: 50  # Cast wider net
  alpha: 0.5
  query_expansion: HyDE  # Hypothetical Document Embeddings
  
reranking:
  enabled: true
  model: cross-encoder/ms-marco-MiniLM-L-12-v2  # Larger model
  two_stage: true  # First cross-encoder, then LLM
  
summarization:
  mode: abstractive  # LLM-based
  model: gpt-4
  max_length: 500 words
  citations: inline
  faithfulness_threshold: 0.95  # Strict

Quality Enhancements:

HyDE query expansion finds conceptually similar docs
Two-stage reranking for maximum precision
GPT-4 abstractive summarization for fluency
Strict faithfulness requirements

System Architecture

┌─────────────────────────────────────────────────────────┐
│                  Client Application                      │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ↓
┌─────────────────────────────────────────────────────────┐
│               Profile Selection Layer                    │
│     (balanced | latency_first | quality_first)          │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ↓
┌─────────────────────────────────────────────────────────┐
│            Document Search Pipeline                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │   Retriever  │→ │   Reranker   │→ │  Summarizer  │ │
│  │   (Hybrid)   │  │  (Optional)  │  │  (Grounded)  │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ↓
┌─────────────────────────────────────────────────────────┐
│              Storage & Indexing Layer                    │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐       │
│  │ OpenSearch │  │  MongoDB   │  │   Redis    │       │
│  │(BM25+kNN)  │  │ (Metadata) │  │  (Cache)   │       │
│  └────────────┘  └────────────┘  └────────────┘       │
└─────────────────────────────────────────────────────────┘

Key Design Principles:

Composability: Each component (retriever, reranker, summarizer) is independent
Configurability: Profiles change configuration, not code
Observability: Each stage reports timing and metrics
Fail-Closed: If quality checks fail, fall back to safer option

Information Retrieval Deep Dive

Hybrid Search Implementation

Core Algorithm: Reciprocal Rank Fusion (RRF)

Problem: How do we combine scores from BM25 and vector search?

Naive Approaches (Don't Work):

# ❌ Simple averaging - different score scales!
combined = (bm25_score + vector_score) / 2

# ❌ Normalization - loses relative differences
combined = normalize(bm25_score) + normalize(vector_score)

RRF Solution (Works!):

def reciprocal_rank_fusion(bm25_results, vector_results, k=60):
    """
    RRF: Combine rankings, not scores.
    
    Insight: Rank position is more meaningful than raw score.
    """
    scores = {}
    
    # BM25 contribution
    for rank, doc_id in enumerate(bm25_results, 1):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    
    # Vector contribution
    for rank, doc_id in enumerate(vector_results, 1):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    
    # Sort by combined score
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Why This Works:

Rank-based: Position matters more than score magnitude
Robust: Works even when score distributions differ
Tunable: Parameter k controls emphasis (typically 60)
Additive: Documents appearing in both lists get higher scores

α-weighting Extension:

def weighted_rrf(bm25_results, vector_results, alpha=0.5, k=60):
    """
    Add configurable weighting between BM25 and vector.
    
    alpha = 0: Pure BM25
    alpha = 0.5: Equal weight
    alpha = 1: Pure vector
    """
    scores = {}
    
    for rank, doc_id in enumerate(bm25_results, 1):
        scores[doc_id] = scores.get(doc_id, 0) + (1 - alpha) / (k + rank)
    
    for rank, doc_id in enumerate(vector_results, 1):
        scores[doc_id] = scores.get(doc_id, 0) + alpha / (k + rank)
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Query Expansion Techniques

Motivation: User queries are often short and miss key terms.

Technique 1: Pseudo-Relevance Feedback (PRF)

Idea: Use top results to expand the query.

Algorithm:

Execute initial search with original query
Extract terms from top-N results (typically N=3-5)
Identify high-frequency terms not in original query
Add these terms to create expanded query
Re-search with expanded query

Example:

Original query: "password reset"

Top result terms: password, reset, account, security, email, verification, link

Expanded query: "password reset account security email verification"

When It Works:

Top results are highly relevant
Query is under-specified
Domain has consistent terminology

When It Fails:

Top results are off-topic (bad initial query)
Over-expansion (too many terms)
Query drift (meaning changes)

Technique 2: HyDE (Hypothetical Document Embeddings)

Idea: Generate what an ideal document would look like, then search for similar documents.

Algorithm:

Prompt LLM: "Write a passage that answers: {query}"
LLM generates hypothetical document
Embed hypothetical document
Search for documents similar to this embedding

Example:

Query: "How does OAuth2 work?"

HyDE Generation (LLM):
"OAuth2 is an authorization framework that enables applications to obtain 
limited access to user accounts. It works by redirecting users to a service 
provider where they authenticate and authorize access. The client receives 
an access token that can be used to make API requests on behalf of the user.
Key components include the authorization server, resource server, client, 
and resource owner. The flow involves authorization codes, access tokens, 
and refresh tokens..."

Search: Documents similar to this detailed explanation

Why It Works:

Bridges vocabulary gap (query language → document language)
Captures intent implicitly
Works well for conceptual queries

Cost Consideration:

Requires LLM call per query (~$0.0001-0.001)
Adds 200-500ms latency
Use selectively (quality-first profile)

Reranking Deep Dive

Problem: Initial retrieval (BM25 + vector) optimizes for recall. We need precision.

Solution: Cross-encoder reranking

How It Works

Bi-Encoder (Retrieval):

Query  → Encoder → [embedding]  ─┐
                                  ├─ Cosine Similarity
Document → Encoder → [embedding] ─┘

Fast but approximate (query and doc encoded separately)

Cross-Encoder (Reranking):

[Query + Document] → Transformer → Relevance Score

Accurate but slow (query and doc processed together)

Two-Stage Architecture:

Stage 1 (Retrieval):
  Bi-encoder searches 10M documents → Top 20-50 candidates
  (Fast: ~100-200ms)

Stage 2 (Reranking):
  Cross-encoder scores 20-50 candidates → Top 5-10
  (Slower: ~50-100ms)

Cross-Encoder Models

Model	Parameters	Latency	Use Case
ms-marco-MiniLM-L-6-v2	22M	~50ms	Balanced
ms-marco-MiniLM-L-12-v2	33M	~100ms	Quality-first
ms-marco-electra-base	110M	~200ms	Research-grade

Practical Tip: Reranking improves nDCG (ranking quality) by 15-30% but adds latency. Profile-based approach lets you choose the tradeoff.

Summarization Techniques

Extractive vs. Abstractive

Fundamental Distinction: How is the summary created?

Extractive Summarization

Definition: Select important sentences from source documents.

Process:

Source: [S1, S2, S3, S4, S5]
↓ Score each sentence
Scores: [0.9, 0.3, 0.8, 0.5, 0.7]
↓ Select top N
Summary: [S1, S3, S5]

Advantages:

✅ Always faithful (sentences are from source)
✅ Fast (no generation)
✅ Free (no LLM API costs)
✅ Easy to cite (sentence → source mapping)

Disadvantages:

❌ Can be choppy (disconnected sentences)
❌ Limited fluency
❌ Redundancy not removed
❌ Can't synthesize across documents

Abstractive Summarization

Definition: Generate new text that captures key information.

Process:

Source: [Documents]
↓ LLM reads and understands
Comprehension: [Key concepts, relationships]
↓ LLM generates new text
Summary: Fluent, synthetic text

Advantages:

✅ Natural, fluent language
✅ Can synthesize across documents
✅ Removes redundancy
✅ Customizable style and length

Disadvantages:

❌ Hallucination risk
❌ Expensive (LLM API costs)
❌ Slower (generation time)
❌ Citation tracking harder

Extractive: TextRank Algorithm

Core Idea: Text is a graph. Important sentences are well-connected.

Algorithm:

1. Build Graph:
   - Nodes = sentences
   - Edges = similarity between sentences
   
2. Calculate PageRank:
   - Iteratively propagate importance
   - Well-connected nodes get high scores
   
3. Select Top-K:
   - Sort by score
   - Return top N sentences

Mathematical Foundation:

PageRank Formula:

PR(Sᵢ) = (1-d) + d × Σ (PR(Sⱼ) / |Out(Sⱼ)|)
                    Sⱼ→Sᵢ

Where:
- PR(Sᵢ) = PageRank score of sentence i
- d = damping factor (typically 0.85)
- Sⱼ→Sᵢ = sentences that link to sentence i
- |Out(Sⱼ)| = number of outgoing edges from sentence j

Similarity Metric (determines edges):

def sentence_similarity(s1, s2):
    """Cosine similarity based on word overlap."""
    words1 = set(s1.lower().split())
    words2 = set(s2.lower().split())
    
    intersection = words1 & words2
    union = words1 | words2
    
    return len(intersection) / len(union) if union else 0

Why It Works:

Captures centrality (sentences that connect to many others)
Democratic (all sentences contribute to scoring)
Unsupervised (no training needed)

Implementation:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

# Initialize
stemmer = Stemmer("english")
summarizer = TextRankSummarizer(stemmer)
summarizer.stop_words = get_stop_words("english")

# Parse document
parser = PlaintextParser.from_string(text, Tokenizer("english"))

# Generate summary (5 sentences)
summary = summarizer(parser.document, 5)

Abstractive: LLM-Based with Grounding

Challenge: LLMs hallucinate. How do we ensure faithfulness?

Solution: Grounded Generation

Grounded Summarization Process

1. Context Preparation:
   [Source 1]: Document content...
   [Source 2]: Document content...
   [Source 3]: Document content...

2. Prompt Engineering:
   "Based ONLY on the sources above, answer: {query}
    - Cite sources using [1], [2], etc.
    - If information not in sources, say so
    - Do not add external knowledge"

3. LLM Generation:
   Summary with inline citations

4. Faithfulness Verification:
   Check that claims are supported by sources

5. Fail-Closed:
   If faithfulness < threshold, use extractive fallback

Faithfulness Verification

Simple Approach (Word Overlap):

def verify_faithfulness(summary, sources):
    """
    Check if summary claims are grounded in sources.
    """
    summary_sentences = split_sentences(summary)
    context_words = set(word for source in sources for word in source.lower().split())
    
    faithful_count = 0
    for sentence in summary_sentences:
        sentence_words = set(sentence.lower().split())
        overlap = len(sentence_words & context_words)
        
        if overlap / len(sentence_words) >= 0.5:  # 50% threshold
            faithful_count += 1
    
    return faithful_count / len(summary_sentences)

Advanced Approach (NLI Model):

from transformers import pipeline

nli_model = pipeline("text-classification", 
                    model="microsoft/deberta-large-mnli")

def verify_with_nli(summary_claim, source_text):
    """
    Use Natural Language Inference to verify claim.
    
    Returns: "entailment" | "neutral" | "contradiction"
    """
    result = nli_model(f"{source_text} [SEP] {summary_claim}")
    return result[0]["label"]

# Usage
for claim in summary_claims:
    for source in sources:
        if verify_with_nli(claim, source) == "entailment":
            # Claim supported by this source
            break
    else:
        # Claim not supported by any source - potential hallucination
        flag_claim(claim)

Citation Management

Goal: Track provenance of every claim.

Data Structure:

@dataclass
class Citation:
    citation_id: int
    document_id: str
    document_title: str
    chunk_id: str
    snippet: str  # 100-150 chars showing where claim came from
    
@dataclass
class GroundedSummary:
    text: str  # "Revenue increased 25% [1]. Market share grew [2]."
    citations: Dict[int, Citation]
    faithfulness: float
    coverage: float

Citation Styles:

Inline (Default):

The company achieved record revenue in Q3 [1]. This was driven by 
strong performance in the EMEA region [2] and new product launches [3].

Footnote:

The company achieved record revenue in Q3¹. This was driven by strong 
performance in the EMEA region² and new product launches³.

---
1. Q3 Financial Report, Page 5
2. Regional Analysis Summary, Page 12
3. Product Launch Timeline, Page 8

Harvard:

The company achieved record revenue in Q3 (Financial Report 2024). 
This was driven by strong performance in the EMEA region (Regional 
Analysis 2024) and new product launches (Product Timeline 2024).

Implementation:

def add_citations(summary_sentences, sources):
    """Add citation markers to summary."""
    cited_summary = []
    citations = {}
    citation_id = 1
    
    for sentence in summary_sentences:
        # Find which source this sentence came from
        source_idx = find_source(sentence, sources)
        
        if source_idx is not None:
            # Add citation
            cited_summary.append(f"{sentence} [{citation_id}]")
            citations[citation_id] = Citation(
                citation_id=citation_id,
                document_id=sources[source_idx].doc_id,
                document_title=sources[source_idx].title,
                chunk_id=sources[source_idx].chunk_id,
                snippet=sentence[:150]
            )
            citation_id += 1
        else:
            # No source found - potential issue
            log_warning(f"Uncited sentence: {sentence}")
            cited_summary.append(sentence)
    
    return " ".join(cited_summary), citations

Profile-Based Configuration

Configuration as Code

Design Pattern: Profiles are first-class objects, not just parameters.

@dataclass
class PipelineConfig:
    """Complete pipeline configuration."""
    profile: ProfileType
    topK: int
    retrieval_alpha: float
    enable_reranking: bool
    enable_query_expansion: bool
    query_expansion_method: Optional[str]
    summarization_mode: SummarizationMode
    max_summary_length: int
    latency_budget_ms: int
    
    # SLO thresholds
    context_precision_threshold: float
    context_recall_threshold: float
    faithfulness_threshold: float
    response_relevancy_threshold: float

Factory Pattern

Usage:

# Simple profile creation
pipeline = DocumentSearchPipeline.create_profile(
    profile=ProfileType.BALANCED,
    store=opensearch_store,
    embedding_function=get_embeddings
)

# Execute with profile-specific behavior
result = pipeline.execute("How do I reset my password?")

Implementation:

@classmethod
def create_profile(cls, profile: ProfileType, store, embedding_function):
    """Factory method for profile-based pipelines."""
    
    if profile == ProfileType.BALANCED:
        return cls._create_balanced_profile(store, embedding_function)
    elif profile == ProfileType.LATENCY_FIRST:
        return cls._create_latency_first_profile(store, embedding_function)
    elif profile == ProfileType.QUALITY_FIRST:
        return cls._create_quality_first_profile(store, embedding_function)

@classmethod
def _create_balanced_profile(cls, store, embedding_function):
    """Configure balanced profile."""
    retriever = HybridDocumentRetriever(
        store=store,
        embedding_function=embedding_function,
        alpha=0.5,
        query_expander=QueryExpander(method="prf")
    )
    
    reranker = CrossEncoderReranker(
        model="cross-encoder/ms-marco-MiniLM-L-6-v2"
    )
    
    summarizer = GroundedSummarizer(
        mode=SummarizationMode.EXTRACTIVE,
        faithfulness_threshold=0.85
    )
    
    config = PipelineConfig(
        profile=ProfileType.BALANCED,
        topK=20,
        retrieval_alpha=0.5,
        enable_reranking=True,
        enable_query_expansion=True,
        query_expansion_method="prf",
        summarization_mode=SummarizationMode.EXTRACTIVE,
        max_summary_length=250,
        latency_budget_ms=500,
        context_precision_threshold=0.70,
        context_recall_threshold=0.70,
        faithfulness_threshold=0.85,
        response_relevancy_threshold=0.75
    )
    
    return cls(retriever, summarizer, config, reranker)

SLO Enforcement

Monitoring:

def execute(self, query, filters=None):
    """Execute with SLO tracking."""
    start_time = time.time()
    timing = {}
    
    # Retrieval
    retrieval_start = time.time()
    results = self.retriever.retrieve(query, topK=self.config.topK)
    timing["retrieval_ms"] = (time.time() - retrieval_start) * 1000
    
    # Check if we're approaching budget
    if timing["retrieval_ms"] > self.config.latency_budget_ms * 0.5:
        log_warning("Retrieval taking too long, may miss SLO")
    
    # Reranking (optional)
    if self.config.enable_reranking:
        reranking_start = time.time()
        results = self.reranker.rerank(query, results)
        timing["reranking_ms"] = (time.time() - reranking_start) * 1000
    
    # Summarization
    summarization_start = time.time()
    summary = self.summarizer.summarize(query, results[:5])
    timing["summarization_ms"] = (time.time() - summarization_start) * 1000
    
    # Total time
    timing["total_ms"] = (time.time() - start_time) * 1000
    
    # SLO check
    slo_met = timing["total_ms"] <= self.config.latency_budget_ms
    
    if not slo_met:
        log_warning(f"SLO missed: {timing['total_ms']:.2f}ms > {self.config.latency_budget_ms}ms")
        emit_metric("slo_violation", {"profile": self.config.profile.value})
    
    return PipelineResult(
        query=query,
        results=results,
        summary=summary,
        timing=timing,
        slo_met=slo_met,
        profile=self.config.profile.value
    )

Implementation Guide

Quick Start

See Complete Working Example:

Demo script: examples/document_search_demo.py
Module README: packages/rag/document_search/README.md

Basic Usage (3 steps):

from packages.rag.document_search import (
    DocumentSearchPipeline, ProfileType, OpenSearchDocumentStore
)

# 1. Initialize with balanced profile
store = OpenSearchDocumentStore("localhost", 9200, "documents")
pipeline = DocumentSearchPipeline.create_profile(
    ProfileType.BALANCED, store, embedding_function
)

# 2. Execute search
result = pipeline.execute("What were the key findings?")

# 3. Use results
print(result.summary.text)  # Grounded summary with citations

Key Implementation Files:

packages/rag/document_search/pipeline.py - Main pipeline with profiles
packages/rag/document_search/store.py - OpenSearch integration
packages/rag/document_search/retriever.py - Hybrid retrieval
packages/rag/document_search/summarizer.py - Grounded summarization

For detailed code examples, see the implementation files and demo script.

Evaluation & Quality Assurance

RAGAS Metrics

RAGAS (Retrieval Augmented Generation Assessment) provides standardized metrics for RAG systems.

Metric 1: Context Precision

Question: Are the retrieved documents relevant to the query?

Formula:

Context Precision = (Relevant Retrieved) / (Total Retrieved)

Example:

Query: "How to reset password?"
Retrieved: 10 documents
Relevant: 7 documents (password reset related)

Context Precision = 7/10 = 0.70

Target: > 0.70 (balanced), > 0.85 (quality-first)

Metric 2: Context Recall

Question: Did we retrieve all relevant documents?

Formula:

Context Recall = (Relevant Retrieved) / (Total Relevant in Corpus)

Example:

Query: "How to reset password?"
Total relevant in corpus: 10 documents
Retrieved: 7 documents

Context Recall = 7/10 = 0.70

Target: > 0.70 (balanced), > 0.85 (quality-first)

Metric 3: Faithfulness

Question: Is the answer grounded in the retrieved context?

Formula:

Faithfulness = (Claims Supported by Context) / (Total Claims in Answer)

Example:

Answer: "Password can be reset via email [✓] or phone [✗]"
Context mentions: email reset
Context does NOT mention: phone reset

Faithfulness = 1/2 = 0.50  (FAIL)

Target: > 0.85 (balanced), > 0.95 (quality-first)

Metric 4: Answer Relevancy

Question: Does the answer actually address the query?

Formula (using cosine similarity):

Answer Relevancy = cosine_similarity(embed(query), embed(answer))

Target: > 0.75 (balanced), > 0.80 (quality-first)

Evaluation Pipeline

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy
)

def evaluate_pipeline(pipeline, test_cases):
    """Evaluate pipeline on test dataset."""
    
    results = []
    for test in test_cases:
        # Execute pipeline
        result = pipeline.execute(test.query, filters=test.filters)
        
        # Prepare data for RAGAS
        results.append({
            "question": test.query,
            "contexts": [r.content for r in result.results],
            "answer": result.summary.text,
            "ground_truth": test.ground_truth
        })
    
    # Calculate metrics
    evaluation = evaluate(
        dataset=results,
        metrics=[
            context_precision,
            context_recall,
            faithfulness,
            answer_relevancy
        ]
    )
    
    return evaluation

CI/CD Gates

Automated Quality Gates:

# .github/workflows/evaluation.yml
name: Quality Gates

on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - name: Run evaluation
        run: python scripts/run_evaluation.py
      
      - name: Check metrics
        run: |
          # Fail if metrics below threshold
          if [ $(jq '.context_precision' metrics.json) < 0.70 ]; then
            echo "Context precision below threshold!"
            exit 1
          fi
          
          if [ $(jq '.faithfulness' metrics.json) < 0.85 ]; then
            echo "Faithfulness below threshold!"
            exit 1
          fi
      
      - name: Check latency
        run: |
          # Fail if P95 latency exceeds budget
          if [ $(jq '.latency_p95' metrics.json) > 500 ]; then
            echo "Latency SLO violated!"
            exit 1
          fi

Production Best Practices

1. Caching Strategy

Multi-Level Caching:

class CachedPipeline:
    """Pipeline with aggressive caching."""
    
    def __init__(self, pipeline, redis_client):
        self.pipeline = pipeline
        self.redis = redis_client
    
    def execute(self, query, filters=None):
        # L1: Query cache (full result)
        cache_key_1 = f"query:{hash(query)}:{hash(str(filters))}"
        cached = self.redis.get(cache_key_1)
        if cached:
            return json.loads(cached)
        
        # L2: Retrieval cache
        cache_key_2 = f"retrieval:{hash(query)}"
        cached_results = self.redis.get(cache_key_2)
        
        if cached_results:
            results = json.loads(cached_results)
        else:
            results = self.pipeline.retriever.retrieve(query, filters)
            self.redis.setex(cache_key_2, 3600, json.dumps(results))  # 1hr TTL
        
        # L3: Summary cache
        cache_key_3 = f"summary:{hash(query)}:{hash(str([r.chunk_id for r in results[:5]]))}"
        cached_summary = self.redis.get(cache_key_3)
        
        if cached_summary:
            summary = json.loads(cached_summary)
        else:
            summary = self.pipeline.summarizer.summarize(query, results[:5])
            self.redis.setex(cache_key_3, 86400, json.dumps(summary))  # 24hr TTL
        
        # Construct result
        result = PipelineResult(...)
        
        # Cache full result
        self.redis.setex(cache_key_1, 3600, json.dumps(result))
        
        return result

Cache Invalidation:

def invalidate_document_cache(document_id):
    """Invalidate all caches related to a document."""
    # When document is updated/deleted
    pattern = f"*:{document_id}:*"
    keys = redis.keys(pattern)
    if keys:
        redis.delete(*keys)

2. Error Handling

Graceful Degradation:

def execute_with_fallback(self, query, filters=None):
    """Execute with multiple fallback levels."""
    
    try:
        # Try quality-first profile
        return self.quality_pipeline.execute(query, filters)
    
    except TimeoutError:
        log_warning("Quality pipeline timeout, falling back to balanced")
        try:
            # Fall back to balanced profile
            return self.balanced_pipeline.execute(query, filters)
        
        except Exception as e:
            log_error(f"Balanced pipeline failed: {e}")
            try:
                # Fall back to latency-first (minimal features)
                return self.latency_pipeline.execute(query, filters)
            
            except Exception as e:
                log_error(f"All pipelines failed: {e}")
                # Return error response
                return ErrorResponse(
                    error="Service temporarily unavailable",
                    retry_after=60
                )

3. Monitoring

Key Metrics to Track:

# Latency
emit_metric("search.latency_ms", timing["total_ms"], {
    "profile": config.profile,
    "slo_met": slo_met
})

# Quality
emit_metric("search.faithfulness", summary.faithfulness, {
    "profile": config.profile,
    "mode": summary.mode
})

# Usage
emit_metric("search.queries_total", 1, {
    "profile": config.profile,
    "cache_hit": cache_hit
})

# Cost
emit_metric("search.cost_usd", cost, {
    "profile": config.profile,
    "llm_calls": llm_calls
})

Dashboards:

┌─────────────────────────────────────────────────┐
│         Document Search Dashboard               │
├─────────────────────────────────────────────────┤
│ Latency (P95)                                   │
│   Balanced:      482ms  [==============] 96.4%  │
│   Latency-First: 198ms  [=========]     99.1%  │
│   Quality-First: 4521ms [==============] 90.4%  │
│                                                 │
│ Quality Metrics (Balanced)                      │
│   Context Precision: 0.73  [====]        ✓     │
│   Faithfulness:      0.89  [=====]       ✓     │
│   Relevancy:         0.78  [====]        ✓     │
│                                                 │
│ Cost (Last 24h)                                 │
│   Balanced:      $12.50  (20.8K queries)       │
│   Latency-First: $7.20   (20.6K queries)       │
│   Quality-First: $1,050  (20 queries)          │
└─────────────────────────────────────────────────┘

4. A/B Testing

Comparative Evaluation:

def ab_test_profiles(test_queries, profile_a, profile_b):
    """Compare two profiles on same queries."""
    
    results_a = []
    results_b = []
    
    for query in test_queries:
        # Profile A
        result_a = pipeline_a.execute(query)
        results_a.append({
            "latency_ms": result_a.timing["total_ms"],
            "faithfulness": result_a.summary.faithfulness,
            "num_results": len(result_a.results)
        })
        
        # Profile B
        result_b = pipeline_b.execute(query)
        results_b.append({
            "latency_ms": result_b.timing["total_ms"],
            "faithfulness": result_b.summary.faithfulness,
            "num_results": len(result_b.results)
        })
    
    # Statistical comparison
    import scipy.stats as stats
    
    latency_p_value = stats.ttest_ind(
        [r["latency_ms"] for r in results_a],
        [r["latency_ms"] for r in results_b]
    ).pvalue
    
    print(f"Latency difference significant: {latency_p_value < 0.05}")
    print(f"Profile A avg latency: {np.mean([r['latency_ms'] for r in results_a]):.2f}ms")
    print(f"Profile B avg latency: {np.mean([r['latency_ms'] for r in results_b]):.2f}ms")

Advanced Topics

1. Multi-Document Summarization

Challenge: Synthesize information across multiple documents.

Approach:

class MultiDocumentSummarizer:
    """Summarize multiple documents together."""
    
    def summarize_multiple(self, query, document_ids):
        """
        Generate comparative/synthesis summary.
        
        Techniques:
        1. Cluster similar information
        2. Identify common themes
        3. Highlight differences
        4. Synthesize with LLM
        """
        
        # Retrieve all documents
        all_chunks = []
        for doc_id in document_ids:
            chunks = self.retrieve_document_chunks(doc_id)
            all_chunks.extend(chunks)
        
        # Cluster by theme
        themes = self.cluster_by_theme(all_chunks)
        
        # Build comparative prompt
        prompt = f"""Compare and synthesize the following documents regarding: {query}

Documents:
{self.format_documents_with_sources(all_chunks)}

Provide:
1. Common themes across documents
2. Key differences or conflicts
3. Synthesized conclusion

Cite sources using [Doc1], [Doc2], etc.
"""
        
        # Generate synthesis
        response = self.llm.invoke(prompt)
        
        return MultiDocSummary(
            text=response.content,
            num_documents=len(document_ids),
            themes=themes,
            citations=self.extract_citations(response.content)
        )

2. Streaming Summarization

Use Case: Real-time summary generation for long documents.

def stream_summary(self, query, contexts):
    """Stream summary tokens as they're generated."""
    
    prompt = self._build_prompt(query, contexts)
    
    for chunk in self.llm.stream(prompt):
        # Yield tokens in real-time
        yield chunk.content
        
        # Optionally: Run citation extraction incrementally
        if chunk.content.endswith(('.', '!', '?')):
            sentence = self.buffer + chunk.content
            citations = self.extract_citations(sentence)
            yield {"citations": citations}
            self.buffer = ""
        else:
            self.buffer += chunk.content

3. Hierarchical Summarization

Use Case: Very long documents (> 100K words).

Strategy: Pyramid summarization.

Level 0 (Original): [C1, C2, C3, ..., C100]  (100 chunks)
        ↓ Summarize in groups of 10
Level 1: [S1, S2, S3, ..., S10]  (10 summaries)
        ↓ Summarize again
Level 2: [S_final]  (1 summary)

Implementation:

def hierarchical_summary(self, document, chunk_size=500, group_size=10):
    """Multi-level summarization for very long documents."""
    
    # Level 0: Chunk document
    chunks = self.chunker.chunk(document, chunk_size)
    
    current_level = chunks
    level = 0
    
    while len(current_level) > 1:
        level += 1
        next_level = []
        
        # Group and summarize
        for i in range(0, len(current_level), group_size):
            group = current_level[i:i+group_size]
            group_text = "\n\n".join([c.content for c in group])
            
            summary = self.summarizer.summarize(
                query="Summarize the key points",
                contexts=group,
                max_length=chunk_size
            )
            
            next_level.append(summary)
        
        current_level = next_level
    
    return current_level[0]  # Final summary

4. Domain-Specific Customization

Legal Documents:

legal_config = PipelineConfig(
    profile=ProfileType.QUALITY_FIRST,
    summarization_mode=SummarizationMode.EXTRACTIVE,  # Prefer exact text
    faithfulness_threshold=0.98,  # Very strict
    max_summary_length=1000,  # Longer, more detailed
    citation_style="bluebook"  # Legal citation format
)

Medical Literature:

medical_config = PipelineConfig(
    profile=ProfileType.QUALITY_FIRST,
    query_expansion_method="mesh",  # Use MeSH terms
    summarization_prompt_template="""
Based on the medical literature provided, summarize:
- Study design and methodology
- Key findings and p-values
- Clinical implications
- Limitations

Cite using: [Author Year, Journal]
""",
    domain_specific_validators=[
        validate_medical_terminology,
        check_dosage_accuracy,
        verify_clinical_recommendations
    ]
)

Real-World Example: Knowledge Base Assistant

Scenario: Customer support team needs to quickly find and summarize KB articles.

Requirements:

< 500ms response time
High accuracy (80%+ relevance)
Clear citations for verification
Handle 1000+ concurrent users

Implementation:

# 1. Initialize system
store = OpenSearchDocumentStore(
    host="opensearch.prod.company.com",
    index_name="kb_articles",
    embedding_dim=3072
)

pipeline = DocumentSearchPipeline.create_profile(
    profile=ProfileType.BALANCED,
    store=store,
    embedding_function=openai_embeddings
)

# 2. Index KB articles
for article in kb_articles:
    doc = {
        "doc_id": article.id,
        "content": article.content,
        "metadata": {
            "title": article.title,
            "category": article.category,
            "last_updated": article.updated_at,
            "author": article.author
        }
    }
    
    chunks = chunker.chunk(doc["content"], doc["metadata"])
    store.index_document(doc["doc_id"], chunks)

# 3. Handle support query
@app.post("/support/answer")
async def answer_support_query(request: SupportQuery):
    """Answer support query with KB articles."""
    
    # Execute search with profile
    result = pipeline.execute(
        query=request.question,
        filters={"category": request.category} if request.category else None
    )
    
    # Check SLO
    if not result.slo_met:
        log_warning(f"SLO missed for query: {request.question}")
    
    # Format response
    return {
        "answer": result.summary.text,
        "confidence": result.summary.faithfulness,
        "sources": [
            {
                "title": citation.document_title,
                "url": f"/kb/{citation.document_id}",
                "snippet": citation.snippet
            }
            for citation in result.summary.citations.values()
        ],
        "latency_ms": result.timing["total_ms"]
    }

# 4. Monitor and optimize
@app.get("/support/metrics")
async def get_metrics():
    """Get system metrics."""
    return {
        "queries_24h": get_query_count_24h(),
        "avg_latency_ms": get_avg_latency(),
        "p95_latency_ms": get_p95_latency(),
        "avg_faithfulness": get_avg_faithfulness(),
        "slo_compliance_rate": get_slo_compliance_rate(),
        "cache_hit_rate": get_cache_hit_rate()
    }

Results (After 1 month):

Average latency: 380ms (vs 500ms SLO) ✅
Faithfulness: 0.91 (vs 0.85 target) ✅
User satisfaction: 4.6/5 ⭐
Support ticket resolution time: -40% 📉
Cost: $0.58 per 1K queries 💰

Summary

Document Search & Summarization combines:

Hybrid Retrieval: BM25 + vector search for comprehensive recall
Intelligent Reranking: Cross-encoders for precision
Grounded Summarization: LLM-based with faithfulness verification
Profile-Based Architecture: Different SLOs for different needs
Production Patterns: Caching, monitoring, fail-closed design

Key Takeaways:

✅ No single approach works for all use cases → profiles
✅ Combine lexical and semantic search → hybrid
✅ Always verify LLM outputs → grounding and faithfulness
✅ Design for observability → metrics at every stage
✅ Quality gates in CI/CD → automated evaluation

Next Steps:

Try the Quick Start Guide
Explore Example Implementations
Read the API Reference
Join the discussion in GitHub Issues

Further Reading:

Overview​

What You'll Learn​

Table of Contents​

Theoretical Foundations​

The Information Retrieval Problem​

Evolution of Search​

1. Keyword Search (1970s-2000s)​

2. TF-IDF (Term Frequency-Inverse Document Frequency)​

3. BM25 (Best Matching 25) - 1990s​

4. Vector Embeddings & Semantic Search (2010s-present)​

The Case for Hybrid Search​

Architecture & Design​

Profile-Based Architecture​

Three Profiles​

Profile 1: Balanced (General-Purpose)​

Profile 2: Latency-First (Interactive)​

Profile 3: Quality-First (Research)​

System Architecture​

Information Retrieval Deep Dive​

Hybrid Search Implementation​

Query Expansion Techniques​

Technique 1: Pseudo-Relevance Feedback (PRF)​

Technique 2: HyDE (Hypothetical Document Embeddings)​

Reranking Deep Dive​

How It Works​

Cross-Encoder Models​

Summarization Techniques​

Extractive vs. Abstractive​

Extractive Summarization​

Abstractive Summarization​

Extractive: TextRank Algorithm​

Abstractive: LLM-Based with Grounding​

Grounded Summarization Process​

Faithfulness Verification​

Citation Management​

Profile-Based Configuration​

Configuration as Code​

Factory Pattern​

SLO Enforcement​

Implementation Guide​

Quick Start​

Evaluation & Quality Assurance​

RAGAS Metrics​

Metric 1: Context Precision​

Metric 2: Context Recall​

Metric 3: Faithfulness​

Metric 4: Answer Relevancy​

Evaluation Pipeline​

CI/CD Gates​

Production Best Practices​

1. Caching Strategy​

2. Error Handling​

3. Monitoring​

4. A/B Testing​

Advanced Topics​

1. Multi-Document Summarization​

2. Streaming Summarization​

3. Hierarchical Summarization​

4. Domain-Specific Customization​

Real-World Example: Knowledge Base Assistant​

Summary​

Overview

What You'll Learn

Table of Contents

Theoretical Foundations

The Information Retrieval Problem

Evolution of Search

1. Keyword Search (1970s-2000s)

2. TF-IDF (Term Frequency-Inverse Document Frequency)

3. BM25 (Best Matching 25) - 1990s

4. Vector Embeddings & Semantic Search (2010s-present)

The Case for Hybrid Search

Architecture & Design

Profile-Based Architecture

Three Profiles

Profile 1: Balanced (General-Purpose)

Profile 2: Latency-First (Interactive)

Profile 3: Quality-First (Research)

System Architecture

Information Retrieval Deep Dive

Hybrid Search Implementation

Query Expansion Techniques

Technique 1: Pseudo-Relevance Feedback (PRF)

Technique 2: HyDE (Hypothetical Document Embeddings)

Reranking Deep Dive

How It Works

Cross-Encoder Models

Summarization Techniques

Extractive vs. Abstractive

Extractive Summarization

Abstractive Summarization

Extractive: TextRank Algorithm

Abstractive: LLM-Based with Grounding

Grounded Summarization Process

Faithfulness Verification

Citation Management

Profile-Based Configuration

Configuration as Code

Factory Pattern

SLO Enforcement

Implementation Guide

Quick Start

Evaluation & Quality Assurance

RAGAS Metrics

Metric 1: Context Precision

Metric 2: Context Recall

Metric 3: Faithfulness

Metric 4: Answer Relevancy

Evaluation Pipeline

CI/CD Gates

Production Best Practices

1. Caching Strategy

2. Error Handling

3. Monitoring

4. A/B Testing

Advanced Topics

1. Multi-Document Summarization

2. Streaming Summarization

3. Hierarchical Summarization

4. Domain-Specific Customization

Real-World Example: Knowledge Base Assistant

Summary