Skip to main content

Document Search & Summarization: A Comprehensive Guide

A deep dive into hybrid retrieval, grounded summarization, and production-ready document intelligence


Overviewโ€‹

Document Search & Summarization is a sophisticated service that combines cutting-edge information retrieval techniques with grounded text summarization to provide accurate, cited, and trustworthy answers from document collections. This guide explores the theoretical foundations, architectural decisions, and practical implementation of a production-ready system.

What You'll Learnโ€‹

  • Information Retrieval Theory: Understanding BM25, vector embeddings, and hybrid search
  • Summarization Techniques: Extractive vs. abstractive, grounding, and citation management
  • Profile-Based Architecture: Designing for different SLOs (latency, quality, cost)
  • Production Patterns: Fail-closed design, faithfulness verification, and evaluation
  • Implementation: Building and deploying a complete system

Table of Contentsโ€‹

  1. Theoretical Foundations
  2. Architecture & Design
  3. Information Retrieval Deep Dive
  4. Summarization Techniques
  5. Profile-Based Configuration
  6. Implementation Guide
  7. Evaluation & Quality Assurance
  8. Production Best Practices
  9. Advanced Topics

Theoretical Foundationsโ€‹

The Information Retrieval Problemโ€‹

Core Challenge: Given a large collection of documents and a user query, how do we find the most relevant information quickly and accurately?

This is a classical problem in computer science with applications in search engines, question answering, and knowledge management. The challenge has multiple dimensions:

  1. Relevance: How well does a document answer the query?
  2. Recall: Are we finding all relevant documents?
  3. Precision: Are the results actually relevant?
  4. Latency: How quickly can we retrieve results?
  5. Scalability: Can the system handle millions of documents?

1. Keyword Search (1970s-2000s)โ€‹

Approach: Match exact words between query and documents.

Example:

Query: "machine learning algorithms"
Matches: Documents containing "machine", "learning", AND "algorithms"

Limitations:

  • Misses synonyms ("ML" vs "machine learning")
  • Ignores semantics ("hot dog" vs "dog that is hot")
  • Vocabulary mismatch between query and document

2. TF-IDF (Term Frequency-Inverse Document Frequency)โ€‹

Innovation: Weight terms by their importance.

Theory:

  • TF (Term Frequency): How often a term appears in a document

    TF(term, doc) = count(term in doc) / total_terms_in_doc
  • IDF (Inverse Document Frequency): How rare a term is across all documents

    IDF(term) = log(total_documents / documents_containing_term)
  • Combined Score:

    TF-IDF(term, doc) = TF(term, doc) ร— IDF(term)

Insight: Common words like "the" get low scores (high document frequency), while rare, informative words get high scores.

3. BM25 (Best Matching 25) - 1990sโ€‹

Evolution: Improved TF-IDF with term saturation and document length normalization.

Formula:

Score(D,Q) = ฮฃ IDF(qแตข) ร— (f(qแตข,D) ร— (kโ‚ + 1)) / (f(qแตข,D) + kโ‚ ร— (1 - b + b ร— |D|/avgdl))

Where:
- D = document
- Q = query
- qแตข = query term
- f(qแตข,D) = term frequency in document
- |D| = document length
- avgdl = average document length
- kโ‚ = term frequency saturation parameter (typically 1.2)
- b = length normalization parameter (typically 0.75)

Key Improvements:

  1. Term Saturation: After a certain point, more occurrences don't increase relevance much
  2. Length Normalization: Adjust for document length (longer docs naturally have more term matches)
  3. Tunable Parameters: kโ‚ and b allow customization for different corpora

Why BM25 Still Matters (2025): Despite being 30+ years old, BM25 remains highly effective for:

  • Exact keyword matching
  • Technical queries with specific terminology
  • Queries where word choice matters
  • Low-latency requirements (no neural network inference)

4. Vector Embeddings & Semantic Search (2010s-present)โ€‹

Paradigm Shift: Represent text as dense vectors in continuous space where semantic similarity = geometric proximity.

How It Works:

  1. Text โ†’ Vector:

    "machine learning" โ†’ [0.23, -0.45, 0.67, ..., 0.12]  # 768-3072 dimensions
  2. Similarity = Cosine Distance:

    similarity(vโ‚, vโ‚‚) = (vโ‚ ยท vโ‚‚) / (||vโ‚|| ร— ||vโ‚‚||)
  3. Semantic Understanding:

    "ML algorithms" โ‰ˆ "machine learning methods" โ‰ˆ "AI techniques"

    All map to similar regions in vector space!

Breakthrough Models:

  • Word2Vec (2013): Word-level embeddings
  • BERT (2018): Contextual embeddings
  • Sentence-BERT (2019): Optimized for semantic search
  • OpenAI Ada/text-embedding-3 (2024): 3072-dimensional embeddings

Advantages:

  • โœ… Handles synonyms and paraphrasing
  • โœ… Understands semantic similarity
  • โœ… Cross-lingual capabilities
  • โœ… Captures context

Limitations:

  • โŒ May miss exact keyword matches
  • โŒ Higher computational cost
  • โŒ "Black box" - harder to explain why results matched

Key Insight: BM25 and vector search are complementary, not competitive.

AspectBM25Vector Search
StrengthExact keywords, technical termsSemantic similarity, paraphrasing
SpeedVery fastRequires inference
ExplainabilityClear (term matching)Opaque (neural network)
Handles SynonymsโŒ Noโœ… Yes
Exact Matchesโœ… ExcellentโŒ May miss
CostFreeEmbedding costs

Solution: Combine both approaches!


Architecture & Designโ€‹

Profile-Based Architectureโ€‹

Design Philosophy: Different use cases require different tradeoffs between latency, quality, and cost.

Instead of one-size-fits-all, we define profiles with specific SLOs (Service Level Objectives):

               User Need
โ†“
Profile Selection
/ | \
Balanced Latency Quality
โ†“ โ†“ โ†“
Different Configurations
โ†“ โ†“ โ†“
Different SLOs

Three Profilesโ€‹

Profile 1: Balanced (General-Purpose)โ€‹

Use Case: Customer support, knowledge bases, general Q&A

SLOs:

  • Latency: < 500ms (P95)
  • Quality: Good (0.7-0.8 metrics)
  • Cost: $0.60 per 1,000 queries

Configuration:

retrieval:
topK: 20
alpha: 0.5 # Equal weight BM25 and vector
query_expansion: PRF # Pseudo-Relevance Feedback

reranking:
enabled: true
model: cross-encoder/ms-marco-MiniLM-L-6-v2 # Light model

summarization:
mode: extractive # Fast, no LLM costs
max_length: 250 words
citations: inline

Why This Works:

  • Hybrid search catches both exact matches and semantic similarity
  • Light reranking improves top results without major latency hit
  • Extractive summarization is fast and always faithful to source

Profile 2: Latency-First (Interactive)โ€‹

Use Case: Auto-complete, real-time chat, instant search

SLOs:

  • Latency: < 250ms (P95)
  • Quality: Acceptable (0.7 metrics)
  • Cost: $0.35 per 1,000 queries

Configuration:

retrieval:
topK: 10 # Fewer results for speed
alpha: 0.7 # Favor BM25 (faster than vector)
query_expansion: disabled # Skip for speed

reranking:
enabled: false # Skip for speed

summarization:
mode: extractive
max_length: 150 words # Shorter
citations: inline

Optimization Strategies:

  • Favor BM25 (no neural network inference)
  • Reduce topK (less processing)
  • Skip query expansion and reranking
  • Cache aggressively

Profile 3: Quality-First (Research)โ€‹

Use Case: Compliance, legal research, scientific literature review

SLOs:

  • Latency: < 5000ms (P95)
  • Quality: Excellent (0.85-0.95 metrics)
  • Cost: $52.50 per 1,000 queries

Configuration:

retrieval:
topK: 50 # Cast wider net
alpha: 0.5
query_expansion: HyDE # Hypothetical Document Embeddings

reranking:
enabled: true
model: cross-encoder/ms-marco-MiniLM-L-12-v2 # Larger model
two_stage: true # First cross-encoder, then LLM

summarization:
mode: abstractive # LLM-based
model: gpt-4
max_length: 500 words
citations: inline
faithfulness_threshold: 0.95 # Strict

Quality Enhancements:

  • HyDE query expansion finds conceptually similar docs
  • Two-stage reranking for maximum precision
  • GPT-4 abstractive summarization for fluency
  • Strict faithfulness requirements

System Architectureโ€‹

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Client Application โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Profile Selection Layer โ”‚
โ”‚ (balanced | latency_first | quality_first) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Document Search Pipeline โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Retriever โ”‚โ†’ โ”‚ Reranker โ”‚โ†’ โ”‚ Summarizer โ”‚ โ”‚
โ”‚ โ”‚ (Hybrid) โ”‚ โ”‚ (Optional) โ”‚ โ”‚ (Grounded) โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Storage & Indexing Layer โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ OpenSearch โ”‚ โ”‚ MongoDB โ”‚ โ”‚ Redis โ”‚ โ”‚
โ”‚ โ”‚(BM25+kNN) โ”‚ โ”‚ (Metadata) โ”‚ โ”‚ (Cache) โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Design Principles:

  1. Composability: Each component (retriever, reranker, summarizer) is independent
  2. Configurability: Profiles change configuration, not code
  3. Observability: Each stage reports timing and metrics
  4. Fail-Closed: If quality checks fail, fall back to safer option

Information Retrieval Deep Diveโ€‹

Hybrid Search Implementationโ€‹

Core Algorithm: Reciprocal Rank Fusion (RRF)

Problem: How do we combine scores from BM25 and vector search?

Naive Approaches (Don't Work):

# โŒ Simple averaging - different score scales!
combined = (bm25_score + vector_score) / 2

# โŒ Normalization - loses relative differences
combined = normalize(bm25_score) + normalize(vector_score)

RRF Solution (Works!):

def reciprocal_rank_fusion(bm25_results, vector_results, k=60):
"""
RRF: Combine rankings, not scores.

Insight: Rank position is more meaningful than raw score.
"""
scores = {}

# BM25 contribution
for rank, doc_id in enumerate(bm25_results, 1):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)

# Vector contribution
for rank, doc_id in enumerate(vector_results, 1):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)

# Sort by combined score
return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Why This Works:

  1. Rank-based: Position matters more than score magnitude
  2. Robust: Works even when score distributions differ
  3. Tunable: Parameter k controls emphasis (typically 60)
  4. Additive: Documents appearing in both lists get higher scores

ฮฑ-weighting Extension:

def weighted_rrf(bm25_results, vector_results, alpha=0.5, k=60):
"""
Add configurable weighting between BM25 and vector.

alpha = 0: Pure BM25
alpha = 0.5: Equal weight
alpha = 1: Pure vector
"""
scores = {}

for rank, doc_id in enumerate(bm25_results, 1):
scores[doc_id] = scores.get(doc_id, 0) + (1 - alpha) / (k + rank)

for rank, doc_id in enumerate(vector_results, 1):
scores[doc_id] = scores.get(doc_id, 0) + alpha / (k + rank)

return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Query Expansion Techniquesโ€‹

Motivation: User queries are often short and miss key terms.

Technique 1: Pseudo-Relevance Feedback (PRF)โ€‹

Idea: Use top results to expand the query.

Algorithm:

1. Execute initial search with original query
2. Extract terms from top-N results (typically N=3-5)
3. Identify high-frequency terms not in original query
4. Add these terms to create expanded query
5. Re-search with expanded query

Example:

Original query: "password reset"

Top result terms: password, reset, account, security, email, verification, link

Expanded query: "password reset account security email verification"

When It Works:

  • Top results are highly relevant
  • Query is under-specified
  • Domain has consistent terminology

When It Fails:

  • Top results are off-topic (bad initial query)
  • Over-expansion (too many terms)
  • Query drift (meaning changes)

Technique 2: HyDE (Hypothetical Document Embeddings)โ€‹

Idea: Generate what an ideal document would look like, then search for similar documents.

Algorithm:

1. Prompt LLM: "Write a passage that answers: {query}"
2. LLM generates hypothetical document
3. Embed hypothetical document
4. Search for documents similar to this embedding

Example:

Query: "How does OAuth2 work?"

HyDE Generation (LLM):
"OAuth2 is an authorization framework that enables applications to obtain
limited access to user accounts. It works by redirecting users to a service
provider where they authenticate and authorize access. The client receives
an access token that can be used to make API requests on behalf of the user.
Key components include the authorization server, resource server, client,
and resource owner. The flow involves authorization codes, access tokens,
and refresh tokens..."

Search: Documents similar to this detailed explanation

Why It Works:

  • Bridges vocabulary gap (query language โ†’ document language)
  • Captures intent implicitly
  • Works well for conceptual queries

Cost Consideration:

  • Requires LLM call per query (~$0.0001-0.001)
  • Adds 200-500ms latency
  • Use selectively (quality-first profile)

Reranking Deep Diveโ€‹

Problem: Initial retrieval (BM25 + vector) optimizes for recall. We need precision.

Solution: Cross-encoder reranking

How It Worksโ€‹

Bi-Encoder (Retrieval):

Query  โ†’ Encoder โ†’ [embedding]  โ”€โ”
โ”œโ”€ Cosine Similarity
Document โ†’ Encoder โ†’ [embedding] โ”€โ”˜

Fast but approximate (query and doc encoded separately)

Cross-Encoder (Reranking):

[Query + Document] โ†’ Transformer โ†’ Relevance Score

Accurate but slow (query and doc processed together)

Two-Stage Architecture:

Stage 1 (Retrieval):
Bi-encoder searches 10M documents โ†’ Top 20-50 candidates
(Fast: ~100-200ms)

Stage 2 (Reranking):
Cross-encoder scores 20-50 candidates โ†’ Top 5-10
(Slower: ~50-100ms)

Cross-Encoder Modelsโ€‹

ModelParametersLatencyUse Case
ms-marco-MiniLM-L-6-v222M~50msBalanced
ms-marco-MiniLM-L-12-v233M~100msQuality-first
ms-marco-electra-base110M~200msResearch-grade

Practical Tip: Reranking improves nDCG (ranking quality) by 15-30% but adds latency. Profile-based approach lets you choose the tradeoff.


Summarization Techniquesโ€‹

Extractive vs. Abstractiveโ€‹

Fundamental Distinction: How is the summary created?

Extractive Summarizationโ€‹

Definition: Select important sentences from source documents.

Process:

Source: [S1, S2, S3, S4, S5]
โ†“ Score each sentence
Scores: [0.9, 0.3, 0.8, 0.5, 0.7]
โ†“ Select top N
Summary: [S1, S3, S5]

Advantages:

  • โœ… Always faithful (sentences are from source)
  • โœ… Fast (no generation)
  • โœ… Free (no LLM API costs)
  • โœ… Easy to cite (sentence โ†’ source mapping)

Disadvantages:

  • โŒ Can be choppy (disconnected sentences)
  • โŒ Limited fluency
  • โŒ Redundancy not removed
  • โŒ Can't synthesize across documents

Abstractive Summarizationโ€‹

Definition: Generate new text that captures key information.

Process:

Source: [Documents]
โ†“ LLM reads and understands
Comprehension: [Key concepts, relationships]
โ†“ LLM generates new text
Summary: Fluent, synthetic text

Advantages:

  • โœ… Natural, fluent language
  • โœ… Can synthesize across documents
  • โœ… Removes redundancy
  • โœ… Customizable style and length

Disadvantages:

  • โŒ Hallucination risk
  • โŒ Expensive (LLM API costs)
  • โŒ Slower (generation time)
  • โŒ Citation tracking harder

Extractive: TextRank Algorithmโ€‹

Core Idea: Text is a graph. Important sentences are well-connected.

Algorithm:

1. Build Graph:
- Nodes = sentences
- Edges = similarity between sentences

2. Calculate PageRank:
- Iteratively propagate importance
- Well-connected nodes get high scores

3. Select Top-K:
- Sort by score
- Return top N sentences

Mathematical Foundation:

PageRank Formula:

PR(Sแตข) = (1-d) + d ร— ฮฃ (PR(Sโฑผ) / |Out(Sโฑผ)|)
Sโฑผโ†’Sแตข

Where:
- PR(Sแตข) = PageRank score of sentence i
- d = damping factor (typically 0.85)
- Sโฑผโ†’Sแตข = sentences that link to sentence i
- |Out(Sโฑผ)| = number of outgoing edges from sentence j

Similarity Metric (determines edges):

def sentence_similarity(s1, s2):
"""Cosine similarity based on word overlap."""
words1 = set(s1.lower().split())
words2 = set(s2.lower().split())

intersection = words1 & words2
union = words1 | words2

return len(intersection) / len(union) if union else 0

Why It Works:

  • Captures centrality (sentences that connect to many others)
  • Democratic (all sentences contribute to scoring)
  • Unsupervised (no training needed)

Implementation:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

# Initialize
stemmer = Stemmer("english")
summarizer = TextRankSummarizer(stemmer)
summarizer.stop_words = get_stop_words("english")

# Parse document
parser = PlaintextParser.from_string(text, Tokenizer("english"))

# Generate summary (5 sentences)
summary = summarizer(parser.document, 5)

Abstractive: LLM-Based with Groundingโ€‹

Challenge: LLMs hallucinate. How do we ensure faithfulness?

Solution: Grounded Generation

Grounded Summarization Processโ€‹

1. Context Preparation:
[Source 1]: Document content...
[Source 2]: Document content...
[Source 3]: Document content...

2. Prompt Engineering:
"Based ONLY on the sources above, answer: {query}
- Cite sources using [1], [2], etc.
- If information not in sources, say so
- Do not add external knowledge"

3. LLM Generation:
Summary with inline citations

4. Faithfulness Verification:
Check that claims are supported by sources

5. Fail-Closed:
If faithfulness < threshold, use extractive fallback

Faithfulness Verificationโ€‹

Simple Approach (Word Overlap):

def verify_faithfulness(summary, sources):
"""
Check if summary claims are grounded in sources.
"""
summary_sentences = split_sentences(summary)
context_words = set(word for source in sources for word in source.lower().split())

faithful_count = 0
for sentence in summary_sentences:
sentence_words = set(sentence.lower().split())
overlap = len(sentence_words & context_words)

if overlap / len(sentence_words) >= 0.5: # 50% threshold
faithful_count += 1

return faithful_count / len(summary_sentences)

Advanced Approach (NLI Model):

from transformers import pipeline

nli_model = pipeline("text-classification",
model="microsoft/deberta-large-mnli")

def verify_with_nli(summary_claim, source_text):
"""
Use Natural Language Inference to verify claim.

Returns: "entailment" | "neutral" | "contradiction"
"""
result = nli_model(f"{source_text} [SEP] {summary_claim}")
return result[0]["label"]

# Usage
for claim in summary_claims:
for source in sources:
if verify_with_nli(claim, source) == "entailment":
# Claim supported by this source
break
else:
# Claim not supported by any source - potential hallucination
flag_claim(claim)

Citation Managementโ€‹

Goal: Track provenance of every claim.

Data Structure:

@dataclass
class Citation:
citation_id: int
document_id: str
document_title: str
chunk_id: str
snippet: str # 100-150 chars showing where claim came from

@dataclass
class GroundedSummary:
text: str # "Revenue increased 25% [1]. Market share grew [2]."
citations: Dict[int, Citation]
faithfulness: float
coverage: float

Citation Styles:

  1. Inline (Default):

    The company achieved record revenue in Q3 [1]. This was driven by 
    strong performance in the EMEA region [2] and new product launches [3].
  2. Footnote:

    The company achieved record revenue in Q3ยน. This was driven by strong 
    performance in the EMEA regionยฒ and new product launchesยณ.

    ---
    1. Q3 Financial Report, Page 5
    2. Regional Analysis Summary, Page 12
    3. Product Launch Timeline, Page 8
  3. Harvard:

    The company achieved record revenue in Q3 (Financial Report 2024). 
    This was driven by strong performance in the EMEA region (Regional
    Analysis 2024) and new product launches (Product Timeline 2024).

Implementation:

def add_citations(summary_sentences, sources):
"""Add citation markers to summary."""
cited_summary = []
citations = {}
citation_id = 1

for sentence in summary_sentences:
# Find which source this sentence came from
source_idx = find_source(sentence, sources)

if source_idx is not None:
# Add citation
cited_summary.append(f"{sentence} [{citation_id}]")
citations[citation_id] = Citation(
citation_id=citation_id,
document_id=sources[source_idx].doc_id,
document_title=sources[source_idx].title,
chunk_id=sources[source_idx].chunk_id,
snippet=sentence[:150]
)
citation_id += 1
else:
# No source found - potential issue
log_warning(f"Uncited sentence: {sentence}")
cited_summary.append(sentence)

return " ".join(cited_summary), citations

Profile-Based Configurationโ€‹

Configuration as Codeโ€‹

Design Pattern: Profiles are first-class objects, not just parameters.

@dataclass
class PipelineConfig:
"""Complete pipeline configuration."""
profile: ProfileType
topK: int
retrieval_alpha: float
enable_reranking: bool
enable_query_expansion: bool
query_expansion_method: Optional[str]
summarization_mode: SummarizationMode
max_summary_length: int
latency_budget_ms: int

# SLO thresholds
context_precision_threshold: float
context_recall_threshold: float
faithfulness_threshold: float
response_relevancy_threshold: float

Factory Patternโ€‹

Usage:

# Simple profile creation
pipeline = DocumentSearchPipeline.create_profile(
profile=ProfileType.BALANCED,
store=opensearch_store,
embedding_function=get_embeddings
)

# Execute with profile-specific behavior
result = pipeline.execute("How do I reset my password?")

Implementation:

@classmethod
def create_profile(cls, profile: ProfileType, store, embedding_function):
"""Factory method for profile-based pipelines."""

if profile == ProfileType.BALANCED:
return cls._create_balanced_profile(store, embedding_function)
elif profile == ProfileType.LATENCY_FIRST:
return cls._create_latency_first_profile(store, embedding_function)
elif profile == ProfileType.QUALITY_FIRST:
return cls._create_quality_first_profile(store, embedding_function)

@classmethod
def _create_balanced_profile(cls, store, embedding_function):
"""Configure balanced profile."""
retriever = HybridDocumentRetriever(
store=store,
embedding_function=embedding_function,
alpha=0.5,
query_expander=QueryExpander(method="prf")
)

reranker = CrossEncoderReranker(
model="cross-encoder/ms-marco-MiniLM-L-6-v2"
)

summarizer = GroundedSummarizer(
mode=SummarizationMode.EXTRACTIVE,
faithfulness_threshold=0.85
)

config = PipelineConfig(
profile=ProfileType.BALANCED,
topK=20,
retrieval_alpha=0.5,
enable_reranking=True,
enable_query_expansion=True,
query_expansion_method="prf",
summarization_mode=SummarizationMode.EXTRACTIVE,
max_summary_length=250,
latency_budget_ms=500,
context_precision_threshold=0.70,
context_recall_threshold=0.70,
faithfulness_threshold=0.85,
response_relevancy_threshold=0.75
)

return cls(retriever, summarizer, config, reranker)

SLO Enforcementโ€‹

Monitoring:

def execute(self, query, filters=None):
"""Execute with SLO tracking."""
start_time = time.time()
timing = {}

# Retrieval
retrieval_start = time.time()
results = self.retriever.retrieve(query, topK=self.config.topK)
timing["retrieval_ms"] = (time.time() - retrieval_start) * 1000

# Check if we're approaching budget
if timing["retrieval_ms"] > self.config.latency_budget_ms * 0.5:
log_warning("Retrieval taking too long, may miss SLO")

# Reranking (optional)
if self.config.enable_reranking:
reranking_start = time.time()
results = self.reranker.rerank(query, results)
timing["reranking_ms"] = (time.time() - reranking_start) * 1000

# Summarization
summarization_start = time.time()
summary = self.summarizer.summarize(query, results[:5])
timing["summarization_ms"] = (time.time() - summarization_start) * 1000

# Total time
timing["total_ms"] = (time.time() - start_time) * 1000

# SLO check
slo_met = timing["total_ms"] <= self.config.latency_budget_ms

if not slo_met:
log_warning(f"SLO missed: {timing['total_ms']:.2f}ms > {self.config.latency_budget_ms}ms")
emit_metric("slo_violation", {"profile": self.config.profile.value})

return PipelineResult(
query=query,
results=results,
summary=summary,
timing=timing,
slo_met=slo_met,
profile=self.config.profile.value
)

Implementation Guideโ€‹

Quick Startโ€‹

See Complete Working Example:

Basic Usage (3 steps):

from packages.rag.document_search import (
DocumentSearchPipeline, ProfileType, OpenSearchDocumentStore
)

# 1. Initialize with balanced profile
store = OpenSearchDocumentStore("localhost", 9200, "documents")
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.BALANCED, store, embedding_function
)

# 2. Execute search
result = pipeline.execute("What were the key findings?")

# 3. Use results
print(result.summary.text) # Grounded summary with citations

Key Implementation Files:

  • packages/rag/document_search/pipeline.py - Main pipeline with profiles
  • packages/rag/document_search/store.py - OpenSearch integration
  • packages/rag/document_search/retriever.py - Hybrid retrieval
  • packages/rag/document_search/summarizer.py - Grounded summarization

For detailed code examples, see the implementation files and demo script.


Evaluation & Quality Assuranceโ€‹

RAGAS Metricsโ€‹

RAGAS (Retrieval Augmented Generation Assessment) provides standardized metrics for RAG systems.

Metric 1: Context Precisionโ€‹

Question: Are the retrieved documents relevant to the query?

Formula:

Context Precision = (Relevant Retrieved) / (Total Retrieved)

Example:

Query: "How to reset password?"
Retrieved: 10 documents
Relevant: 7 documents (password reset related)

Context Precision = 7/10 = 0.70

Target: > 0.70 (balanced), > 0.85 (quality-first)

Metric 2: Context Recallโ€‹

Question: Did we retrieve all relevant documents?

Formula:

Context Recall = (Relevant Retrieved) / (Total Relevant in Corpus)

Example:

Query: "How to reset password?"
Total relevant in corpus: 10 documents
Retrieved: 7 documents

Context Recall = 7/10 = 0.70

Target: > 0.70 (balanced), > 0.85 (quality-first)

Metric 3: Faithfulnessโ€‹

Question: Is the answer grounded in the retrieved context?

Formula:

Faithfulness = (Claims Supported by Context) / (Total Claims in Answer)

Example:

Answer: "Password can be reset via email [โœ“] or phone [โœ—]"
Context mentions: email reset
Context does NOT mention: phone reset

Faithfulness = 1/2 = 0.50 (FAIL)

Target: > 0.85 (balanced), > 0.95 (quality-first)

Metric 4: Answer Relevancyโ€‹

Question: Does the answer actually address the query?

Formula (using cosine similarity):

Answer Relevancy = cosine_similarity(embed(query), embed(answer))

Target: > 0.75 (balanced), > 0.80 (quality-first)

Evaluation Pipelineโ€‹

from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy
)

def evaluate_pipeline(pipeline, test_cases):
"""Evaluate pipeline on test dataset."""

results = []
for test in test_cases:
# Execute pipeline
result = pipeline.execute(test.query, filters=test.filters)

# Prepare data for RAGAS
results.append({
"question": test.query,
"contexts": [r.content for r in result.results],
"answer": result.summary.text,
"ground_truth": test.ground_truth
})

# Calculate metrics
evaluation = evaluate(
dataset=results,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy
]
)

return evaluation

CI/CD Gatesโ€‹

Automated Quality Gates:

# .github/workflows/evaluation.yml
name: Quality Gates

on: [pull_request]

jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Run evaluation
run: python scripts/run_evaluation.py

- name: Check metrics
run: |
# Fail if metrics below threshold
if [ $(jq '.context_precision' metrics.json) < 0.70 ]; then
echo "Context precision below threshold!"
exit 1
fi

if [ $(jq '.faithfulness' metrics.json) < 0.85 ]; then
echo "Faithfulness below threshold!"
exit 1
fi

- name: Check latency
run: |
# Fail if P95 latency exceeds budget
if [ $(jq '.latency_p95' metrics.json) > 500 ]; then
echo "Latency SLO violated!"
exit 1
fi

Production Best Practicesโ€‹

1. Caching Strategyโ€‹

Multi-Level Caching:

class CachedPipeline:
"""Pipeline with aggressive caching."""

def __init__(self, pipeline, redis_client):
self.pipeline = pipeline
self.redis = redis_client

def execute(self, query, filters=None):
# L1: Query cache (full result)
cache_key_1 = f"query:{hash(query)}:{hash(str(filters))}"
cached = self.redis.get(cache_key_1)
if cached:
return json.loads(cached)

# L2: Retrieval cache
cache_key_2 = f"retrieval:{hash(query)}"
cached_results = self.redis.get(cache_key_2)

if cached_results:
results = json.loads(cached_results)
else:
results = self.pipeline.retriever.retrieve(query, filters)
self.redis.setex(cache_key_2, 3600, json.dumps(results)) # 1hr TTL

# L3: Summary cache
cache_key_3 = f"summary:{hash(query)}:{hash(str([r.chunk_id for r in results[:5]]))}"
cached_summary = self.redis.get(cache_key_3)

if cached_summary:
summary = json.loads(cached_summary)
else:
summary = self.pipeline.summarizer.summarize(query, results[:5])
self.redis.setex(cache_key_3, 86400, json.dumps(summary)) # 24hr TTL

# Construct result
result = PipelineResult(...)

# Cache full result
self.redis.setex(cache_key_1, 3600, json.dumps(result))

return result

Cache Invalidation:

def invalidate_document_cache(document_id):
"""Invalidate all caches related to a document."""
# When document is updated/deleted
pattern = f"*:{document_id}:*"
keys = redis.keys(pattern)
if keys:
redis.delete(*keys)

2. Error Handlingโ€‹

Graceful Degradation:

def execute_with_fallback(self, query, filters=None):
"""Execute with multiple fallback levels."""

try:
# Try quality-first profile
return self.quality_pipeline.execute(query, filters)

except TimeoutError:
log_warning("Quality pipeline timeout, falling back to balanced")
try:
# Fall back to balanced profile
return self.balanced_pipeline.execute(query, filters)

except Exception as e:
log_error(f"Balanced pipeline failed: {e}")
try:
# Fall back to latency-first (minimal features)
return self.latency_pipeline.execute(query, filters)

except Exception as e:
log_error(f"All pipelines failed: {e}")
# Return error response
return ErrorResponse(
error="Service temporarily unavailable",
retry_after=60
)

3. Monitoringโ€‹

Key Metrics to Track:

# Latency
emit_metric("search.latency_ms", timing["total_ms"], {
"profile": config.profile,
"slo_met": slo_met
})

# Quality
emit_metric("search.faithfulness", summary.faithfulness, {
"profile": config.profile,
"mode": summary.mode
})

# Usage
emit_metric("search.queries_total", 1, {
"profile": config.profile,
"cache_hit": cache_hit
})

# Cost
emit_metric("search.cost_usd", cost, {
"profile": config.profile,
"llm_calls": llm_calls
})

Dashboards:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Document Search Dashboard โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Latency (P95) โ”‚
โ”‚ Balanced: 482ms [==============] 96.4% โ”‚
โ”‚ Latency-First: 198ms [=========] 99.1% โ”‚
โ”‚ Quality-First: 4521ms [==============] 90.4% โ”‚
โ”‚ โ”‚
โ”‚ Quality Metrics (Balanced) โ”‚
โ”‚ Context Precision: 0.73 [====] โœ“ โ”‚
โ”‚ Faithfulness: 0.89 [=====] โœ“ โ”‚
โ”‚ Relevancy: 0.78 [====] โœ“ โ”‚
โ”‚ โ”‚
โ”‚ Cost (Last 24h) โ”‚
โ”‚ Balanced: $12.50 (20.8K queries) โ”‚
โ”‚ Latency-First: $7.20 (20.6K queries) โ”‚
โ”‚ Quality-First: $1,050 (20 queries) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4. A/B Testingโ€‹

Comparative Evaluation:

def ab_test_profiles(test_queries, profile_a, profile_b):
"""Compare two profiles on same queries."""

results_a = []
results_b = []

for query in test_queries:
# Profile A
result_a = pipeline_a.execute(query)
results_a.append({
"latency_ms": result_a.timing["total_ms"],
"faithfulness": result_a.summary.faithfulness,
"num_results": len(result_a.results)
})

# Profile B
result_b = pipeline_b.execute(query)
results_b.append({
"latency_ms": result_b.timing["total_ms"],
"faithfulness": result_b.summary.faithfulness,
"num_results": len(result_b.results)
})

# Statistical comparison
import scipy.stats as stats

latency_p_value = stats.ttest_ind(
[r["latency_ms"] for r in results_a],
[r["latency_ms"] for r in results_b]
).pvalue

print(f"Latency difference significant: {latency_p_value < 0.05}")
print(f"Profile A avg latency: {np.mean([r['latency_ms'] for r in results_a]):.2f}ms")
print(f"Profile B avg latency: {np.mean([r['latency_ms'] for r in results_b]):.2f}ms")

Advanced Topicsโ€‹

1. Multi-Document Summarizationโ€‹

Challenge: Synthesize information across multiple documents.

Approach:

class MultiDocumentSummarizer:
"""Summarize multiple documents together."""

def summarize_multiple(self, query, document_ids):
"""
Generate comparative/synthesis summary.

Techniques:
1. Cluster similar information
2. Identify common themes
3. Highlight differences
4. Synthesize with LLM
"""

# Retrieve all documents
all_chunks = []
for doc_id in document_ids:
chunks = self.retrieve_document_chunks(doc_id)
all_chunks.extend(chunks)

# Cluster by theme
themes = self.cluster_by_theme(all_chunks)

# Build comparative prompt
prompt = f"""Compare and synthesize the following documents regarding: {query}

Documents:
{self.format_documents_with_sources(all_chunks)}

Provide:
1. Common themes across documents
2. Key differences or conflicts
3. Synthesized conclusion

Cite sources using [Doc1], [Doc2], etc.
"""

# Generate synthesis
response = self.llm.invoke(prompt)

return MultiDocSummary(
text=response.content,
num_documents=len(document_ids),
themes=themes,
citations=self.extract_citations(response.content)
)

2. Streaming Summarizationโ€‹

Use Case: Real-time summary generation for long documents.

def stream_summary(self, query, contexts):
"""Stream summary tokens as they're generated."""

prompt = self._build_prompt(query, contexts)

for chunk in self.llm.stream(prompt):
# Yield tokens in real-time
yield chunk.content

# Optionally: Run citation extraction incrementally
if chunk.content.endswith(('.', '!', '?')):
sentence = self.buffer + chunk.content
citations = self.extract_citations(sentence)
yield {"citations": citations}
self.buffer = ""
else:
self.buffer += chunk.content

3. Hierarchical Summarizationโ€‹

Use Case: Very long documents (> 100K words).

Strategy: Pyramid summarization.

Level 0 (Original): [C1, C2, C3, ..., C100]  (100 chunks)
โ†“ Summarize in groups of 10
Level 1: [S1, S2, S3, ..., S10] (10 summaries)
โ†“ Summarize again
Level 2: [S_final] (1 summary)

Implementation:

def hierarchical_summary(self, document, chunk_size=500, group_size=10):
"""Multi-level summarization for very long documents."""

# Level 0: Chunk document
chunks = self.chunker.chunk(document, chunk_size)

current_level = chunks
level = 0

while len(current_level) > 1:
level += 1
next_level = []

# Group and summarize
for i in range(0, len(current_level), group_size):
group = current_level[i:i+group_size]
group_text = "\n\n".join([c.content for c in group])

summary = self.summarizer.summarize(
query="Summarize the key points",
contexts=group,
max_length=chunk_size
)

next_level.append(summary)

current_level = next_level

return current_level[0] # Final summary

4. Domain-Specific Customizationโ€‹

Legal Documents:

legal_config = PipelineConfig(
profile=ProfileType.QUALITY_FIRST,
summarization_mode=SummarizationMode.EXTRACTIVE, # Prefer exact text
faithfulness_threshold=0.98, # Very strict
max_summary_length=1000, # Longer, more detailed
citation_style="bluebook" # Legal citation format
)

Medical Literature:

medical_config = PipelineConfig(
profile=ProfileType.QUALITY_FIRST,
query_expansion_method="mesh", # Use MeSH terms
summarization_prompt_template="""
Based on the medical literature provided, summarize:
- Study design and methodology
- Key findings and p-values
- Clinical implications
- Limitations

Cite using: [Author Year, Journal]
""",
domain_specific_validators=[
validate_medical_terminology,
check_dosage_accuracy,
verify_clinical_recommendations
]
)

Real-World Example: Knowledge Base Assistantโ€‹

Scenario: Customer support team needs to quickly find and summarize KB articles.

Requirements:

  • < 500ms response time
  • High accuracy (80%+ relevance)
  • Clear citations for verification
  • Handle 1000+ concurrent users

Implementation:

# 1. Initialize system
store = OpenSearchDocumentStore(
host="opensearch.prod.company.com",
index_name="kb_articles",
embedding_dim=3072
)

pipeline = DocumentSearchPipeline.create_profile(
profile=ProfileType.BALANCED,
store=store,
embedding_function=openai_embeddings
)

# 2. Index KB articles
for article in kb_articles:
doc = {
"doc_id": article.id,
"content": article.content,
"metadata": {
"title": article.title,
"category": article.category,
"last_updated": article.updated_at,
"author": article.author
}
}

chunks = chunker.chunk(doc["content"], doc["metadata"])
store.index_document(doc["doc_id"], chunks)

# 3. Handle support query
@app.post("/support/answer")
async def answer_support_query(request: SupportQuery):
"""Answer support query with KB articles."""

# Execute search with profile
result = pipeline.execute(
query=request.question,
filters={"category": request.category} if request.category else None
)

# Check SLO
if not result.slo_met:
log_warning(f"SLO missed for query: {request.question}")

# Format response
return {
"answer": result.summary.text,
"confidence": result.summary.faithfulness,
"sources": [
{
"title": citation.document_title,
"url": f"/kb/{citation.document_id}",
"snippet": citation.snippet
}
for citation in result.summary.citations.values()
],
"latency_ms": result.timing["total_ms"]
}

# 4. Monitor and optimize
@app.get("/support/metrics")
async def get_metrics():
"""Get system metrics."""
return {
"queries_24h": get_query_count_24h(),
"avg_latency_ms": get_avg_latency(),
"p95_latency_ms": get_p95_latency(),
"avg_faithfulness": get_avg_faithfulness(),
"slo_compliance_rate": get_slo_compliance_rate(),
"cache_hit_rate": get_cache_hit_rate()
}

Results (After 1 month):

  • Average latency: 380ms (vs 500ms SLO) โœ…
  • Faithfulness: 0.91 (vs 0.85 target) โœ…
  • User satisfaction: 4.6/5 โญ
  • Support ticket resolution time: -40% ๐Ÿ“‰
  • Cost: $0.58 per 1K queries ๐Ÿ’ฐ

Summaryโ€‹

Document Search & Summarization combines:

  1. Hybrid Retrieval: BM25 + vector search for comprehensive recall
  2. Intelligent Reranking: Cross-encoders for precision
  3. Grounded Summarization: LLM-based with faithfulness verification
  4. Profile-Based Architecture: Different SLOs for different needs
  5. Production Patterns: Caching, monitoring, fail-closed design

Key Takeaways:

  • โœ… No single approach works for all use cases โ†’ profiles
  • โœ… Combine lexical and semantic search โ†’ hybrid
  • โœ… Always verify LLM outputs โ†’ grounding and faithfulness
  • โœ… Design for observability โ†’ metrics at every stage
  • โœ… Quality gates in CI/CD โ†’ automated evaluation

Next Steps:

  1. Try the Quick Start Guide
  2. Explore Example Implementations
  3. Read the API Reference
  4. Join the discussion in GitHub Issues

Further Reading: