Skip to main content

Document Search & Summarization - Quick Reference

One-page cheat sheet for quick lookup


Three Profiles at a Glance

ProfileLatencyQualityCostWhen to Use
Balanced500ms0.7-0.8$0.60/1KGeneral Q&A, customer support
Latency-First250ms0.7$0.35/1KInteractive search, auto-complete
Quality-First5s0.85-0.95$52/1KResearch, compliance, legal

Quick Start (3 Steps)

# 1. Setup
from packages.rag.document_search import *

store = OpenSearchDocumentStore("localhost", 9200, "documents")
pipeline = DocumentSearchPipeline.create_profile(
ProfileType.BALANCED, store, embedding_function
)

# 2. Execute
result = pipeline.execute("Your query here")

# 3. Use Results
print(result.summary.text) # Summary
print(result.summary.citations) # Citations
print(result.timing) # Performance metrics

Architecture

Query → Retriever → Reranker → Summarizer → Result
↓ ↓ ↓
(Hybrid) (Optional) (Grounded)
BM25+Vec CrossEnc Extract/Abstract

Key Concepts

Hybrid Search = BM25 + Vector

  • BM25: Exact keyword matching (fast, explainable)
  • Vector: Semantic similarity (handles synonyms)
  • RRF: Combines both with reciprocal rank fusion
  • α: Weight parameter (0=BM25 only, 1=Vector only, 0.5=equal)

Query Expansion

  • PRF (Pseudo-Relevance Feedback): Use top results to expand query
  • HyDE (Hypothetical Doc Embeddings): Generate ideal doc, search for similar

Reranking

  • Bi-Encoder (retrieval): Fast, approximate similarity
  • Cross-Encoder (reranking): Slow, accurate relevance scoring
  • Two-Stage: Bi-encoder gets 50 candidates → Cross-encoder ranks → Top 5

Summarization

  • Extractive: Select important sentences (fast, faithful, free)
  • Abstractive: Generate new text (fluent, expensive, risk of hallucination)
  • Grounded: Verify claims against sources, cite everything

Configuration Patterns

Pattern 1: Customer Support KB

config = PipelineConfig(
profile=ProfileType.BALANCED,
topK=20,
alpha=0.5,
enable_reranking=True,
summarization_mode=SummarizationMode.EXTRACTIVE,
max_summary_length=250
)

Pattern 2: Legal/Compliance

config = PipelineConfig(
profile=ProfileType.QUALITY_FIRST,
topK=50,
alpha=0.5,
enable_reranking=True,
query_expansion_method="hyde",
summarization_mode=SummarizationMode.ABSTRACTIVE,
faithfulness_threshold=0.95
)

Pattern 3: Real-Time Chat

config = PipelineConfig(
profile=ProfileType.LATENCY_FIRST,
topK=10,
alpha=0.7, # Favor BM25 for speed
enable_reranking=False,
enable_query_expansion=False,
summarization_mode=SummarizationMode.EXTRACTIVE,
max_summary_length=150
)

RAGAS Metrics

MetricFormulaTarget (Balanced)Target (Quality)
Context PrecisionRelevant/Retrieved> 0.70> 0.85
Context RecallRetrieved/Total Relevant> 0.70> 0.85
FaithfulnessSupported Claims/Total Claims> 0.85> 0.95
Answer Relevancycosine(query, answer)> 0.75> 0.80

Common Tuning Parameters

Retrieval

topK = 20  # Number of results to retrieve
alpha = 0.5 # BM25 vs vector weight (0-1)

BM25

k1 = 1.2  # Term frequency saturation (0.5-3.0)
b = 0.75 # Length normalization (0-1)

RRF

k = 60  # Rank constant (typically 60)

Reranking

model = "cross-encoder/ms-marco-MiniLM-L-6-v2"  # Balanced
model = "cross-encoder/ms-marco-MiniLM-L-12-v2" # Quality
top_k = 10 # Rerank top N candidates

Summarization

max_length = 250  # Words
faithfulness_threshold = 0.85 # Minimum acceptance

Caching Strategy

# L1: Full result cache (1 hour TTL)
cache_key = f"query:{hash(query)}:{hash(filters)}"

# L2: Retrieval cache (1 hour TTL)
cache_key = f"retrieval:{hash(query)}"

# L3: Summary cache (24 hour TTL)
cache_key = f"summary:{doc_ids_hash}"

# L4: Embedding cache (7 day TTL)
cache_key = f"embedding:{content_hash}"

Expected Savings: 40-60% cost reduction


Error Handling

try:
result = pipeline.execute(query)
except TimeoutError:
# Fall back to faster profile
result = latency_pipeline.execute(query)
except Exception as e:
# Log and return error
log_error(f"Pipeline failed: {e}")
return ErrorResponse("Service unavailable")

Monitoring Metrics

# Latency
emit("search.latency_ms", timing["total_ms"])

# Quality
emit("search.faithfulness", summary.faithfulness)
emit("search.slo_met", slo_met)

# Usage
emit("search.queries_total", 1)
emit("search.cache_hit", cache_hit)

# Cost
emit("search.cost_usd", cost)

API Quick Reference

# Create pipeline
pipeline = DocumentSearchPipeline.create_profile(profile, store, embed_fn)

# Execute search
result = pipeline.execute(query, filters={...})

# Access results
result.query # Original query
result.results # List[RetrievalResult]
result.summary # GroundedSummary
result.timing # Dict[str, float]
result.slo_met # bool
result.facets # Dict (if enabled)

# Summary details
result.summary.text # Summary text
result.summary.citations # Dict[int, Citation]
result.summary.faithfulness # float
result.summary.coverage # float
result.summary.key_points # List[str]

# Citation details
citation = result.summary.citations[1]
citation.document_id # str
citation.document_title # str
citation.chunk_id # str
citation.snippet # str (100-150 chars)

Troubleshooting

IssueLikely CauseSolution
Slow searchesProfile misconfiguredUse latency-first, enable caching
Low faithfulnessPoor source qualityUse extractive mode, verify sources
Poor relevanceWrong alpha weightTune α, enable query expansion
High costsToo many LLM callsUse extractive, implement caching
SLO violationsToo ambitious targetsAdjust profile or hardware

Performance Targets

ProfileLatency P95ThroughputConcurrent Users
Balanced< 500ms100 QPS1000+
Latency-First< 250ms200 QPS2000+
Quality-First< 5000ms20 QPS200+

Cost Breakdown (per 1K queries)

ComponentBalancedLatency-FirstQuality-First
Retrieval$0.40$0.25$1.50
Reranking$0.20$0.00$0.50
Summarization$0.00$0.00$50.00
Embedding$0.10$0.10$0.50
Total$0.60$0.35$52.50

Decision Tree

Start

Need < 250ms latency? → Yes → Latency-First
↓ No
Need 95%+ faithfulness? → Yes → Quality-First
↓ No
General use case? → Yes → Balanced
↓ No
Custom requirements? → Create custom profile

Further Reading


Last Updated: October 9, 2025
Version: 1.0