Skip to main content

Document Search & Summarization - Refined Implementation Plan

Created: October 9, 2025
Status: Ready for Implementation
Version: 2.0 (Refined with Production Best Practices)


๐ŸŽฏ Key Decisionsโ€‹

โœ… Storage: S3 for documents
โœ… Default Summarization: Extractive (with grounded citations)
โœ… Initial Formats: PDF, DOCX, XLSX (expand later)
โœ… Processing Mode: Adaptive (sync < 10MB, async >= 10MB)
โœ… Architecture: Profile-based pipeline (balanced, latency-first, quality-first)


๐Ÿ“ Architecture: Profile-Based Pipelineโ€‹

Core Conceptโ€‹

Instead of one-size-fits-all, we define three profiles with different SLOs:

Pipeline = Retriever โ†’ Reranker โ†’ Summarizer
โ†“ โ†“ โ†“
Profile-based configuration

Three Profilesโ€‹

1. Balanced Profile (Default)โ€‹

Use Case: General-purpose document Q&A

ComponentConfigurationTarget
RetrievaltopK=20, hybrid (BM25 + vector, ฮฑ=0.5)< 200ms
RerankingLight cross-encoder, topK=10< 100ms
SummarizationExtractive multi-chunk with citations< 200ms
Total P95 Latency< 500ms

Metrics Gates:

  • Context Precision > 0.7
  • Context Recall > 0.7
  • Faithfulness > 0.85
  • Response Relevancy > 0.75

2. Latency-First Profileโ€‹

Use Case: Interactive search, auto-complete, real-time assistants

ComponentConfigurationTarget
RetrievaltopK=10, vector-only or BM25-only< 100ms
RerankingShallow or skip< 50ms
SummarizationSingle-chunk extractive, tight window< 100ms
Total P95 Latency< 250ms

Metrics Gates:

  • Response Relevancy > 0.7
  • Latency P95 < 250ms
  • Throughput > 200 QPS

3. Quality-First Profileโ€‹

Use Case: Complex research, legal compliance, multi-document synthesis

ComponentConfigurationTarget
RetrievaltopK=50, hybrid with query expansion< 500ms
RerankingDeep cross-encoder + LLM reranking< 1000ms
SummarizationMulti-doc abstractive with strict citations< 3000ms
Total P95 Latency< 5000ms

Metrics Gates:

  • Context Precision > 0.85
  • Context Recall > 0.85
  • Faithfulness > 0.95
  • Citation Accuracy > 0.9

๐Ÿ—๏ธ Component Architectureโ€‹

1. Store Interface (Unified)โ€‹

class DocumentStore(ABC):
"""Unified store interface for consistency across backends."""

@abstractmethod
def hybrid_search(
self,
query: str,
query_embedding: List[float],
filters: Dict[str, Any],
topK: int,
alpha: float # BM25 vs vector weight
) -> List[SearchResult]:
"""Hybrid BM25 + ANN search with score fusion."""
pass

@abstractmethod
def get_facets(
self,
query: str,
filters: Dict[str, Any]
) -> Dict[str, List[FacetValue]]:
"""Get faceted navigation options."""
pass

Backends:

  • OpenSearch (provisioned) - Full control, advanced aggregations
  • OpenSearch Serverless - Low-ops, auto-scaling
  • MongoDB Atlas Vector Search - Alternative backend

2. Retriever (Hybrid)โ€‹

class HybridDocumentRetriever:
"""
Hybrid BM25 + ANN retrieval with configurable fusion.

Features:
- Reciprocal Rank Fusion (RRF)
- Query expansion (PRF, HyDE)
- Intent-based gating
- Faceted filtering
"""

def __init__(
self,
store: DocumentStore,
alpha: float = 0.5, # BM25 vs vector weight
query_expansion: Optional[QueryExpander] = None,
filter_builder: Optional[FilterBuilder] = None
):
self.store = store
self.alpha = alpha
self.query_expansion = query_expansion
self.filter_builder = filter_builder

def retrieve(
self,
query: str,
topK: int = 20,
filters: Optional[Dict[str, Any]] = None,
expand_query: bool = True
) -> List[RetrievalResult]:
"""Execute hybrid retrieval with optional query expansion."""

# Intent detection & gating
intent = self._detect_intent(query)

# Query expansion (PRF or HyDE)
if expand_query and intent in ["complex", "research"]:
expanded_queries = self.query_expansion.expand(query, method="hyde")
else:
expanded_queries = [query]

# Build filters
if self.filter_builder:
filters = self.filter_builder.build(query, filters)

# Execute hybrid search for each expanded query
all_results = []
for exp_query in expanded_queries:
embedding = self._get_embedding(exp_query)
results = self.store.hybrid_search(
query=exp_query,
query_embedding=embedding,
filters=filters,
topK=topK,
alpha=self.alpha
)
all_results.extend(results)

# Reciprocal Rank Fusion
fused_results = self._reciprocal_rank_fusion(all_results, k=60)

return fused_results[:topK]

3. Grounded Summarizer (Citation-Aware)โ€‹

class GroundedSummarizer:
"""
Grounded summarization with explicit citations.

Key Features:
- Only uses retrieved context (no hallucinations)
- Extractive and abstractive modes
- Citation tracking at sentence level
- Coverage and faithfulness metrics
"""

def __init__(
self,
mode: str = "extractive", # extractive, abstractive
citation_style: str = "inline" # inline, footnote
):
self.mode = mode
self.citation_style = citation_style

def summarize(
self,
query: str,
contexts: List[RetrievalResult],
max_length: int = 250
) -> GroundedSummary:
"""Generate grounded summary with citations."""

if self.mode == "extractive":
return self._extractive_summary(query, contexts, max_length)
else:
return self._abstractive_summary(query, contexts, max_length)

def _extractive_summary(
self,
query: str,
contexts: List[RetrievalResult],
max_length: int
) -> GroundedSummary:
"""Extract key sentences with citations."""

# Score sentences by relevance to query
sentences = []
for ctx in contexts:
for sent in self._split_sentences(ctx.content):
score = self._score_sentence(sent, query)
sentences.append({
"text": sent,
"score": score,
"source": ctx.metadata["document_id"],
"chunk_id": ctx.chunk_id
})

# Sort and select top sentences
sentences.sort(key=lambda x: x["score"], reverse=True)

# Build summary with citations
summary_parts = []
total_length = 0
used_sources = set()

for sent_info in sentences:
sent_length = len(sent_info["text"].split())
if total_length + sent_length > max_length:
break

# Add citation
citation = f'[{len(used_sources) + 1}]'
summary_parts.append(f'{sent_info["text"]} {citation}')
used_sources.add(sent_info["source"])
total_length += sent_length

summary_text = " ".join(summary_parts)

# Build citation map
citations = {
idx + 1: {
"document_id": src,
"document_title": self._get_document_title(src)
}
for idx, src in enumerate(used_sources)
}

return GroundedSummary(
text=summary_text,
citations=citations,
coverage=self._calculate_coverage(contexts, summary_parts),
faithfulness=1.0, # Extractive is always faithful
mode="extractive"
)

def _abstractive_summary(
self,
query: str,
contexts: List[RetrievalResult],
max_length: int
) -> GroundedSummary:
"""Generate abstractive summary with LLM + strict grounding."""

# Build context with source IDs
context_text = ""
for idx, ctx in enumerate(contexts, 1):
context_text += f"\n[Source {idx}]: {ctx.content}\n"

# Prompt engineering for grounded generation
prompt = f"""Based ONLY on the provided sources, answer this question: {query}

Sources:
{context_text}

Instructions:
1. Only use information from the sources above
2. Cite sources using [1], [2], etc. after each claim
3. If information is not in sources, say "Information not available"
4. Keep summary under {max_length} words
5. Be factual and specific

Answer:"""

# Generate with LLM
response = self.llm.invoke(prompt)
summary_text = response.content

# Extract citations from generated text
citations = self._extract_citations(summary_text, contexts)

# Verify faithfulness
faithfulness = self._verify_faithfulness(summary_text, contexts)

# Fail closed if faithfulness is low
if faithfulness < 0.8:
return self._extractive_summary(query, contexts, max_length)

return GroundedSummary(
text=summary_text,
citations=citations,
coverage=self._calculate_coverage(contexts, [summary_text]),
faithfulness=faithfulness,
mode="abstractive"
)

4. Pipeline Abstractionโ€‹

class DocumentSearchPipeline:
"""
Configurable pipeline: Retriever โ†’ Reranker โ†’ Summarizer

Profiles:
- balanced: General-purpose, moderate quality & latency
- latency_first: Interactive, fast response
- quality_first: Research-grade, comprehensive
"""

@classmethod
def create_profile(cls, profile: str, store: DocumentStore) -> "DocumentSearchPipeline":
"""Factory method for profile-based pipelines."""

if profile == "balanced":
return cls(
retriever=HybridDocumentRetriever(
store=store,
alpha=0.5,
query_expansion=QueryExpander(method="prf")
),
reranker=CrossEncoderReranker(
model="ms-marco-MiniLM-L-6-v2",
topK=10
),
summarizer=GroundedSummarizer(
mode="extractive",
citation_style="inline"
),
config=PipelineConfig(
topK=20,
enable_reranking=True,
enable_query_expansion=True,
max_summary_length=250,
latency_budget_ms=500
)
)

elif profile == "latency_first":
return cls(
retriever=HybridDocumentRetriever(
store=store,
alpha=0.7, # Favor BM25 (faster)
query_expansion=None # Skip expansion
),
reranker=None, # Skip reranking
summarizer=GroundedSummarizer(
mode="extractive",
citation_style="inline"
),
config=PipelineConfig(
topK=10,
enable_reranking=False,
enable_query_expansion=False,
max_summary_length=150,
latency_budget_ms=250
)
)

elif profile == "quality_first":
return cls(
retriever=HybridDocumentRetriever(
store=store,
alpha=0.5,
query_expansion=QueryExpander(method="hyde")
),
reranker=LLMReranker( # Deep reranking
model="gpt-3.5-turbo",
topK=20
),
summarizer=GroundedSummarizer(
mode="abstractive",
citation_style="inline"
),
config=PipelineConfig(
topK=50,
enable_reranking=True,
enable_query_expansion=True,
max_summary_length=500,
latency_budget_ms=5000
)
)

def execute(
self,
query: str,
filters: Optional[Dict[str, Any]] = None
) -> PipelineResult:
"""Execute full pipeline with SLO tracking."""

start_time = time.time()

# 1. Retrieval
retrieval_start = time.time()
results = self.retriever.retrieve(
query=query,
topK=self.config.topK,
filters=filters,
expand_query=self.config.enable_query_expansion
)
retrieval_time = (time.time() - retrieval_start) * 1000

# 2. Reranking (optional)
if self.config.enable_reranking and self.reranker:
reranking_start = time.time()
results = self.reranker.rerank(query, results)
reranking_time = (time.time() - reranking_start) * 1000
else:
reranking_time = 0

# 3. Summarization
summarization_start = time.time()
summary = self.summarizer.summarize(
query=query,
contexts=results[:5], # Top 5 for summary
max_length=self.config.max_summary_length
)
summarization_time = (time.time() - summarization_start) * 1000

total_time = (time.time() - start_time) * 1000

# Check SLO compliance
slo_met = total_time <= self.config.latency_budget_ms

return PipelineResult(
query=query,
results=results,
summary=summary,
timing={
"retrieval_ms": retrieval_time,
"reranking_ms": reranking_time,
"summarization_ms": summarization_time,
"total_ms": total_time
},
slo_met=slo_met,
profile=self.config.profile
)

๐Ÿ“Š Evaluation Framework (RAGAS + Custom)โ€‹

RAGAS Metrics as Release Gatesโ€‹

class EvaluationGates:
"""
RAGAS-based evaluation gates for CI/CD.

Gates:
- Context Precision: Relevant chunks in top-K
- Context Recall: Coverage of ground truth
- Faithfulness: No hallucinations
- Response Relevancy: Answer matches query
- Noise Sensitivity: Robust to irrelevant docs
"""

PROFILE_THRESHOLDS = {
"balanced": {
"context_precision": 0.70,
"context_recall": 0.70,
"faithfulness": 0.85,
"response_relevancy": 0.75,
"noise_sensitivity": 0.80,
"latency_p95_ms": 500
},
"latency_first": {
"response_relevancy": 0.70,
"latency_p95_ms": 250,
"throughput_qps": 200
},
"quality_first": {
"context_precision": 0.85,
"context_recall": 0.85,
"faithfulness": 0.95,
"citation_accuracy": 0.90,
"latency_p95_ms": 5000
}
}

def evaluate_profile(
self,
profile: str,
pipeline: DocumentSearchPipeline,
test_fixtures: List[TestCase]
) -> EvaluationReport:
"""
Evaluate pipeline against profile thresholds.

Returns:
- pass/fail for each metric
- detailed scores
- recommendations
"""

thresholds = self.PROFILE_THRESHOLDS[profile]

# Run RAGAS evaluation
results = []
latencies = []

for test_case in test_fixtures:
result = pipeline.execute(
query=test_case.query,
filters=test_case.filters
)

# Calculate RAGAS metrics
metrics = self._calculate_ragas_metrics(
query=test_case.query,
contexts=result.results,
answer=result.summary.text,
ground_truth=test_case.ground_truth
)

results.append(metrics)
latencies.append(result.timing["total_ms"])

# Aggregate metrics
avg_metrics = self._aggregate_metrics(results)
p95_latency = np.percentile(latencies, 95)

# Check gates
gates_passed = {}
for metric, threshold in thresholds.items():
if metric == "latency_p95_ms":
gates_passed[metric] = p95_latency <= threshold
else:
gates_passed[metric] = avg_metrics.get(metric, 0) >= threshold

all_passed = all(gates_passed.values())

return EvaluationReport(
profile=profile,
metrics=avg_metrics,
latency_p95=p95_latency,
gates_passed=gates_passed,
all_gates_passed=all_passed,
recommendations=self._generate_recommendations(avg_metrics, thresholds)
)

def _calculate_ragas_metrics(
self,
query: str,
contexts: List[RetrievalResult],
answer: str,
ground_truth: str
) -> Dict[str, float]:
"""Calculate RAGAS metrics for a single query."""

from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy
)

# Prepare data for RAGAS
data = {
"question": [query],
"contexts": [[ctx.content for ctx in contexts]],
"answer": [answer],
"ground_truth": [ground_truth]
}

# Evaluate
result = evaluate(
dataset=data,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy
]
)

return {
"context_precision": result["context_precision"],
"context_recall": result["context_recall"],
"faithfulness": result["faithfulness"],
"response_relevancy": result["answer_relevancy"]
}

๐ŸŽญ User Stories & Business Framingโ€‹

Story 1: Knowledge Assistant (Support Tickets)โ€‹

Scenario: Support engineer needs to resolve customer ticket using KB articles

Profile: Balanced

Flow:

  1. Engineer enters customer question
  2. System searches KB with hybrid retrieval
  3. Returns top 5 relevant articles with citations
  4. Generates grounded summary with inline citations
  5. Engineer verifies citations and responds to customer

Success Metrics:

  • Time to resolution: < 5 minutes (down from 15 minutes)
  • Citation accuracy: > 90%
  • Customer satisfaction: > 4.5/5

Evaluation:

  • Faithfulness > 0.85 (no hallucinations)
  • Coverage > 0.7 (answer addresses query)
  • Latency < 500ms (interactive experience)

Story 2: Policy/Compliance QA (Audit-Ready)โ€‹

Scenario: Compliance officer verifies policy adherence

Profile: Quality-First

Flow:

  1. Officer asks complex compliance question
  2. System searches only approved policy corpus
  3. Deep reranking ensures highest relevance
  4. Multi-document synthesis with strict citations
  5. Fail closed if coverage < 80%
  6. Log decision for audit trail

Success Metrics:

  • Citation accuracy: > 95%
  • Zero unapproved sources used
  • Audit log completeness: 100%

Evaluation:

  • Faithfulness > 0.95 (strict)
  • Context precision > 0.85
  • Coverage > 0.8 (or fail closed)

Story 3: Engineering Docs Search (Team Updates)โ€‹

Scenario: Engineering manager prepares quarterly update using design docs

Profile: Quality-First (with faceted filters)

Flow:

  1. Manager searches across design docs by team and quarter
  2. Apply facets: team=backend, quarter=Q3, status=approved
  3. System retrieves and synthesizes key decisions
  4. Generates multi-document summary with timeline
  5. Manager reviews and shares with leadership

Success Metrics:

  • Time to prepare: < 30 minutes (down from 4 hours)
  • Comprehensive coverage: > 90% of key decisions
  • Leadership satisfaction: > 4.7/5

Evaluation:

  • Context recall > 0.85 (comprehensive)
  • Multi-doc coherence > 0.8
  • Citation accuracy > 0.9

๐Ÿš€ Implementation Roadmapโ€‹

Phase 0: Foundation (Week 1)โ€‹

Goal: Lock profiles, stub interfaces, run minimal balanced flow

Tasks:

  1. โœ… Define three profile configurations
  2. โฌœ Implement DocumentStore interface
  3. โฌœ Implement HybridDocumentRetriever (basic)
  4. โฌœ Implement GroundedSummarizer (extractive only)
  5. โฌœ Implement DocumentSearchPipeline with balanced profile
  6. โฌœ Create 10-query test fixture
  7. โฌœ Run baseline evaluation

Deliverable: End-to-end balanced flow with baseline metrics

Success Criteria:

  • Pipeline executes without errors
  • Baseline metrics recorded
  • P95 latency measured

Phase 1: Document Loading (Week 2)โ€‹

Goal: Load PDF, DOCX, XLSX with metadata extraction

Tasks:

  1. โฌœ Implement PDFLoader (pypdf)
  2. โฌœ Implement DOCXLoader (python-docx)
  3. โฌœ Implement ExcelLoader (openpyxl)
  4. โฌœ Implement DocumentLoaderFactory
  5. โฌœ Add S3 upload/download integration
  6. โฌœ Write loader tests (each format)
  7. โฌœ Integration with existing chunker

Deliverable: Universal document loader for 3 formats

Success Criteria:

  • All 3 formats load correctly
  • Metadata extracted accurately
  • Test coverage > 85%

Phase 2: Search Enhancement (Week 3)โ€‹

Goal: Complete hybrid retrieval with query expansion and facets

Tasks:

  1. โฌœ Implement query expansion (PRF and HyDE)
  2. โฌœ Add intent detection and gating
  3. โฌœ Implement faceted filtering
  4. โฌœ Add RRF score fusion
  5. โฌœ Optimize OpenSearch queries
  6. โฌœ Add result caching (Redis)
  7. โฌœ Performance testing

Deliverable: Production-grade hybrid retrieval

Success Criteria:

  • Context precision > 0.7
  • P95 latency < 200ms
  • Cache hit rate > 40%

Phase 3: Grounded Summarization (Week 4)โ€‹

Goal: Extractive + abstractive summarization with citations

Tasks:

  1. โฌœ Complete extractive summarizer (TextRank)
  2. โฌœ Implement abstractive summarizer (GPT-3.5)
  3. โฌœ Add citation extraction and tracking
  4. โฌœ Implement faithfulness verification
  5. โฌœ Add fail-closed logic for low faithfulness
  6. โฌœ Multi-document summarization
  7. โฌœ Summarization tests

Deliverable: Grounded summarization with citations

Success Criteria:

  • Faithfulness > 0.85
  • Citation accuracy > 0.9
  • No hallucinations detected

Phase 4: Profile Implementation (Week 5)โ€‹

Goal: Implement all three profiles with SLO tracking

Tasks:

  1. โฌœ Implement latency-first profile
  2. โฌœ Implement quality-first profile
  3. โฌœ Add SLO tracking and alerting
  4. โฌœ Profile-specific optimizations
  5. โฌœ A/B testing framework
  6. โฌœ Profile comparison tool

Deliverable: Three production-ready profiles

Success Criteria:

  • All profiles meet SLOs
  • Evaluation gates pass
  • Profile selection logic works

Phase 5: Evaluation Harness (Week 6)โ€‹

Goal: RAGAS-based CI/CD gates

Tasks:

  1. โฌœ Implement RAGAS evaluation wrapper
  2. โฌœ Create tenant test fixtures (50+ queries)
  3. โฌœ Add custom metrics (citation accuracy, coverage)
  4. โฌœ Integrate with CI/CD pipeline
  5. โฌœ Set up regression detection
  6. โฌœ Create evaluation dashboard

Deliverable: Automated evaluation in CI

Success Criteria:

  • RAGAS metrics calculated correctly
  • CI gates block failing PRs
  • Regression detection works

Phase 6: API & Documentation (Week 7)โ€‹

Goal: REST API with Docusaurus docs

Tasks:

  1. โฌœ Implement FastAPI endpoints (7 endpoints)
  2. โฌœ Add authentication and rate limiting
  3. โฌœ OpenAPI/Swagger documentation
  4. โฌœ Docusaurus setup with clear IA
  5. โฌœ Tutorial for each profile
  6. โฌœ User story examples
  7. โฌœ Algolia DocSearch integration

Deliverable: Complete API with docs

Success Criteria:

  • All endpoints documented
  • Tutorials runnable
  • Search works in docs

Phase 7: Production Deployment (Week 8)โ€‹

Goal: Deploy with monitoring and alerting

Tasks:

  1. โฌœ OpenSearch provisioned setup (or Serverless)
  2. โฌœ S3 bucket configuration
  3. โฌœ MongoDB Atlas setup
  4. โฌœ Redis cache setup
  5. โฌœ Prometheus + Grafana dashboards
  6. โฌœ Alert rules and runbooks
  7. โฌœ Load testing (1000+ QPS)
  8. โฌœ Security audit

Deliverable: Production-ready deployment

Success Criteria:

  • 99.9% uptime
  • All SLOs met
  • Alerts working
  • Security approved

๐Ÿ“Š Monitoring & SLOsโ€‹

Profile-Based SLOsโ€‹

MetricBalancedLatency-FirstQuality-First
P95 Latency< 500ms< 250ms< 5000ms
Context Precision> 0.70-> 0.85
Context Recall> 0.70-> 0.85
Faithfulness> 0.85-> 0.95
Response Relevancy> 0.75> 0.70> 0.80
Throughput100 QPS200 QPS20 QPS
Availability99.9%99.9%99.5%

Dashboardsโ€‹

  1. System Health: Latency, error rate, throughput
  2. Quality Metrics: RAGAS scores over time
  3. Profile Performance: Per-profile SLO tracking
  4. Cost Tracking: LLM costs, storage, compute
  5. User Stories: Metrics per user story

๐Ÿงช Test Fixturesโ€‹

Tenant Fixture Structureโ€‹

{
"fixture_name": "support_kb_fixture",
"description": "Customer support knowledge base queries",
"profile": "balanced",
"test_cases": [
{
"query": "How do I reset my password?",
"filters": {"category": "account"},
"ground_truth": "Navigate to Settings > Security > Reset Password...",
"expected_sources": ["doc_123", "doc_456"],
"expected_latency_ms": 400,
"slo_requirements": {
"context_precision": 0.75,
"faithfulness": 0.90
}
}
]
}

Initial Test Fixtures (Week 1)โ€‹

  1. Support KB Fixture - 10 queries
  2. Policy Compliance Fixture - 10 queries
  3. Engineering Docs Fixture - 10 queries

Total: 30 queries for baseline

Expanded Fixtures (Week 6)โ€‹

  • 50+ queries per fixture
  • Edge cases and adversarial queries
  • Multi-language support (future)
  • Domain-specific fixtures

๐Ÿ“– Documentation Structure (Docusaurus)โ€‹

Information Architectureโ€‹

docs/
โ”œโ”€โ”€ getting-started/
โ”‚ โ”œโ”€โ”€ introduction.md
โ”‚ โ”œโ”€โ”€ quick-start.md
โ”‚ โ””โ”€โ”€ installation.md
โ”œโ”€โ”€ tutorials/
โ”‚ โ”œโ”€โ”€ balanced-profile-tutorial.md
โ”‚ โ”œโ”€โ”€ latency-first-tutorial.md
โ”‚ โ”œโ”€โ”€ quality-first-tutorial.md
โ”‚ โ””โ”€โ”€ custom-profile-tutorial.md
โ”œโ”€โ”€ user-stories/
โ”‚ โ”œโ”€โ”€ knowledge-assistant.md
โ”‚ โ”œโ”€โ”€ policy-compliance.md
โ”‚ โ””โ”€โ”€ engineering-docs.md
โ”œโ”€โ”€ api-reference/
โ”‚ โ”œโ”€โ”€ overview.md
โ”‚ โ”œโ”€โ”€ upload-api.md
โ”‚ โ”œโ”€โ”€ search-api.md
โ”‚ โ””โ”€โ”€ summarization-api.md
โ”œโ”€โ”€ guides/
โ”‚ โ”œโ”€โ”€ profile-selection.md
โ”‚ โ”œโ”€โ”€ query-optimization.md
โ”‚ โ”œโ”€โ”€ citation-handling.md
โ”‚ โ””โ”€โ”€ evaluation-metrics.md
โ”œโ”€โ”€ deployment/
โ”‚ โ”œโ”€โ”€ opensearch-provisioned.md
โ”‚ โ”œโ”€โ”€ opensearch-serverless.md
โ”‚ โ””โ”€โ”€ production-checklist.md
โ””โ”€โ”€ troubleshooting/
โ”œโ”€โ”€ latency-issues.md
โ”œโ”€โ”€ quality-issues.md
โ””โ”€โ”€ common-errors.md

Search Integrationโ€‹

Option 1: Algolia DocSearch (Recommended)

  • Contextual search in docs
  • Instant results
  • Analytics

Option 2: Local search plugin

  • For air-gapped deployments
  • Self-hosted
  • Privacy-focused

๐Ÿ’ฐ Refined Cost Analysisโ€‹

Profile-Based Costs (per 1000 queries)โ€‹

ProfileComputeLLM APIStorage I/OTotal
Balanced$0.50$0.00$0.10$0.60
Latency-First$0.30$0.00$0.05$0.35
Quality-First$2.00$50.00$0.50$52.50

Notes:

  • Balanced: Extractive only (no LLM costs)
  • Quality-First: Abstractive with GPT-3.5 ($0.05 per summary)
  • Caching reduces costs by 40-60%

Monthly Infrastructure (10,000 users)โ€‹

ComponentSizeCost
OpenSearch (provisioned)3x m5.xlarge.search$350
S35TB docs$115
MongoDB AtlasM30$210
Rediscache.m5.large$150
Compute (ECS)10x t3.large$730
Total$1,555/month

Cost per user: $0.16/month


๐ŸŽฏ Immediate Next Stepsโ€‹

This Week (Week 0)โ€‹

  1. โœ… Lock profile definitions
  2. โฌœ Stub DocumentStore, HybridRetriever, GroundedSummarizer interfaces
  3. โฌœ Implement minimal balanced profile
  4. โฌœ Create 10-query test fixture
  5. โฌœ Run baseline evaluation
  6. โฌœ Measure baseline latency and metrics

Goal: End-to-end flow by Friday

Next Week (Week 1)โ€‹

  1. โฌœ Implement document loaders (PDF, DOCX, XLSX)
  2. โฌœ S3 integration
  3. โฌœ Expand test fixtures to 30 queries
  4. โฌœ Optimize baseline metrics

Goal: Full document loading pipeline


๐Ÿ“ž Key Contacts & Reviewโ€‹

Technical Review: Backend team, ML team
Business Review: Product manager
Security Review: Security engineer
Final Approval: CTO

Next Review: End of Week 1 (baseline metrics)


This refined plan incorporates all the production best practices while staying focused on delivering value through user stories. Ready to start implementation! ๐Ÿš€