Document Search & Summarization - Refined Implementation Plan
Created: October 9, 2025
Status: Ready for Implementation
Version: 2.0 (Refined with Production Best Practices)
๐ฏ Key Decisionsโ
โ
Storage: S3 for documents
โ
Default Summarization: Extractive (with grounded citations)
โ
Initial Formats: PDF, DOCX, XLSX (expand later)
โ
Processing Mode: Adaptive (sync < 10MB, async >= 10MB)
โ
Architecture: Profile-based pipeline (balanced, latency-first, quality-first)
๐ Architecture: Profile-Based Pipelineโ
Core Conceptโ
Instead of one-size-fits-all, we define three profiles with different SLOs:
Pipeline = Retriever โ Reranker โ Summarizer
โ โ โ
Profile-based configuration
Three Profilesโ
1. Balanced Profile (Default)โ
Use Case: General-purpose document Q&A
| Component | Configuration | Target |
|---|---|---|
| Retrieval | topK=20, hybrid (BM25 + vector, ฮฑ=0.5) | < 200ms |
| Reranking | Light cross-encoder, topK=10 | < 100ms |
| Summarization | Extractive multi-chunk with citations | < 200ms |
| Total P95 Latency | < 500ms |
Metrics Gates:
- Context Precision > 0.7
- Context Recall > 0.7
- Faithfulness > 0.85
- Response Relevancy > 0.75
2. Latency-First Profileโ
Use Case: Interactive search, auto-complete, real-time assistants
| Component | Configuration | Target |
|---|---|---|
| Retrieval | topK=10, vector-only or BM25-only | < 100ms |
| Reranking | Shallow or skip | < 50ms |
| Summarization | Single-chunk extractive, tight window | < 100ms |
| Total P95 Latency | < 250ms |
Metrics Gates:
- Response Relevancy > 0.7
- Latency P95 < 250ms
- Throughput > 200 QPS
3. Quality-First Profileโ
Use Case: Complex research, legal compliance, multi-document synthesis
| Component | Configuration | Target |
|---|---|---|
| Retrieval | topK=50, hybrid with query expansion | < 500ms |
| Reranking | Deep cross-encoder + LLM reranking | < 1000ms |
| Summarization | Multi-doc abstractive with strict citations | < 3000ms |
| Total P95 Latency | < 5000ms |
Metrics Gates:
- Context Precision > 0.85
- Context Recall > 0.85
- Faithfulness > 0.95
- Citation Accuracy > 0.9
๐๏ธ Component Architectureโ
1. Store Interface (Unified)โ
class DocumentStore(ABC):
"""Unified store interface for consistency across backends."""
@abstractmethod
def hybrid_search(
self,
query: str,
query_embedding: List[float],
filters: Dict[str, Any],
topK: int,
alpha: float # BM25 vs vector weight
) -> List[SearchResult]:
"""Hybrid BM25 + ANN search with score fusion."""
pass
@abstractmethod
def get_facets(
self,
query: str,
filters: Dict[str, Any]
) -> Dict[str, List[FacetValue]]:
"""Get faceted navigation options."""
pass
Backends:
- OpenSearch (provisioned) - Full control, advanced aggregations
- OpenSearch Serverless - Low-ops, auto-scaling
- MongoDB Atlas Vector Search - Alternative backend
2. Retriever (Hybrid)โ
class HybridDocumentRetriever:
"""
Hybrid BM25 + ANN retrieval with configurable fusion.
Features:
- Reciprocal Rank Fusion (RRF)
- Query expansion (PRF, HyDE)
- Intent-based gating
- Faceted filtering
"""
def __init__(
self,
store: DocumentStore,
alpha: float = 0.5, # BM25 vs vector weight
query_expansion: Optional[QueryExpander] = None,
filter_builder: Optional[FilterBuilder] = None
):
self.store = store
self.alpha = alpha
self.query_expansion = query_expansion
self.filter_builder = filter_builder
def retrieve(
self,
query: str,
topK: int = 20,
filters: Optional[Dict[str, Any]] = None,
expand_query: bool = True
) -> List[RetrievalResult]:
"""Execute hybrid retrieval with optional query expansion."""
# Intent detection & gating
intent = self._detect_intent(query)
# Query expansion (PRF or HyDE)
if expand_query and intent in ["complex", "research"]:
expanded_queries = self.query_expansion.expand(query, method="hyde")
else:
expanded_queries = [query]
# Build filters
if self.filter_builder:
filters = self.filter_builder.build(query, filters)
# Execute hybrid search for each expanded query
all_results = []
for exp_query in expanded_queries:
embedding = self._get_embedding(exp_query)
results = self.store.hybrid_search(
query=exp_query,
query_embedding=embedding,
filters=filters,
topK=topK,
alpha=self.alpha
)
all_results.extend(results)
# Reciprocal Rank Fusion
fused_results = self._reciprocal_rank_fusion(all_results, k=60)
return fused_results[:topK]
3. Grounded Summarizer (Citation-Aware)โ
class GroundedSummarizer:
"""
Grounded summarization with explicit citations.
Key Features:
- Only uses retrieved context (no hallucinations)
- Extractive and abstractive modes
- Citation tracking at sentence level
- Coverage and faithfulness metrics
"""
def __init__(
self,
mode: str = "extractive", # extractive, abstractive
citation_style: str = "inline" # inline, footnote
):
self.mode = mode
self.citation_style = citation_style
def summarize(
self,
query: str,
contexts: List[RetrievalResult],
max_length: int = 250
) -> GroundedSummary:
"""Generate grounded summary with citations."""
if self.mode == "extractive":
return self._extractive_summary(query, contexts, max_length)
else:
return self._abstractive_summary(query, contexts, max_length)
def _extractive_summary(
self,
query: str,
contexts: List[RetrievalResult],
max_length: int
) -> GroundedSummary:
"""Extract key sentences with citations."""
# Score sentences by relevance to query
sentences = []
for ctx in contexts:
for sent in self._split_sentences(ctx.content):
score = self._score_sentence(sent, query)
sentences.append({
"text": sent,
"score": score,
"source": ctx.metadata["document_id"],
"chunk_id": ctx.chunk_id
})
# Sort and select top sentences
sentences.sort(key=lambda x: x["score"], reverse=True)
# Build summary with citations
summary_parts = []
total_length = 0
used_sources = set()
for sent_info in sentences:
sent_length = len(sent_info["text"].split())
if total_length + sent_length > max_length:
break
# Add citation
citation = f'[{len(used_sources) + 1}]'
summary_parts.append(f'{sent_info["text"]} {citation}')
used_sources.add(sent_info["source"])
total_length += sent_length
summary_text = " ".join(summary_parts)
# Build citation map
citations = {
idx + 1: {
"document_id": src,
"document_title": self._get_document_title(src)
}
for idx, src in enumerate(used_sources)
}
return GroundedSummary(
text=summary_text,
citations=citations,
coverage=self._calculate_coverage(contexts, summary_parts),
faithfulness=1.0, # Extractive is always faithful
mode="extractive"
)
def _abstractive_summary(
self,
query: str,
contexts: List[RetrievalResult],
max_length: int
) -> GroundedSummary:
"""Generate abstractive summary with LLM + strict grounding."""
# Build context with source IDs
context_text = ""
for idx, ctx in enumerate(contexts, 1):
context_text += f"\n[Source {idx}]: {ctx.content}\n"
# Prompt engineering for grounded generation
prompt = f"""Based ONLY on the provided sources, answer this question: {query}
Sources:
{context_text}
Instructions:
1. Only use information from the sources above
2. Cite sources using [1], [2], etc. after each claim
3. If information is not in sources, say "Information not available"
4. Keep summary under {max_length} words
5. Be factual and specific
Answer:"""
# Generate with LLM
response = self.llm.invoke(prompt)
summary_text = response.content
# Extract citations from generated text
citations = self._extract_citations(summary_text, contexts)
# Verify faithfulness
faithfulness = self._verify_faithfulness(summary_text, contexts)
# Fail closed if faithfulness is low
if faithfulness < 0.8:
return self._extractive_summary(query, contexts, max_length)
return GroundedSummary(
text=summary_text,
citations=citations,
coverage=self._calculate_coverage(contexts, [summary_text]),
faithfulness=faithfulness,
mode="abstractive"
)
4. Pipeline Abstractionโ
class DocumentSearchPipeline:
"""
Configurable pipeline: Retriever โ Reranker โ Summarizer
Profiles:
- balanced: General-purpose, moderate quality & latency
- latency_first: Interactive, fast response
- quality_first: Research-grade, comprehensive
"""
@classmethod
def create_profile(cls, profile: str, store: DocumentStore) -> "DocumentSearchPipeline":
"""Factory method for profile-based pipelines."""
if profile == "balanced":
return cls(
retriever=HybridDocumentRetriever(
store=store,
alpha=0.5,
query_expansion=QueryExpander(method="prf")
),
reranker=CrossEncoderReranker(
model="ms-marco-MiniLM-L-6-v2",
topK=10
),
summarizer=GroundedSummarizer(
mode="extractive",
citation_style="inline"
),
config=PipelineConfig(
topK=20,
enable_reranking=True,
enable_query_expansion=True,
max_summary_length=250,
latency_budget_ms=500
)
)
elif profile == "latency_first":
return cls(
retriever=HybridDocumentRetriever(
store=store,
alpha=0.7, # Favor BM25 (faster)
query_expansion=None # Skip expansion
),
reranker=None, # Skip reranking
summarizer=GroundedSummarizer(
mode="extractive",
citation_style="inline"
),
config=PipelineConfig(
topK=10,
enable_reranking=False,
enable_query_expansion=False,
max_summary_length=150,
latency_budget_ms=250
)
)
elif profile == "quality_first":
return cls(
retriever=HybridDocumentRetriever(
store=store,
alpha=0.5,
query_expansion=QueryExpander(method="hyde")
),
reranker=LLMReranker( # Deep reranking
model="gpt-3.5-turbo",
topK=20
),
summarizer=GroundedSummarizer(
mode="abstractive",
citation_style="inline"
),
config=PipelineConfig(
topK=50,
enable_reranking=True,
enable_query_expansion=True,
max_summary_length=500,
latency_budget_ms=5000
)
)
def execute(
self,
query: str,
filters: Optional[Dict[str, Any]] = None
) -> PipelineResult:
"""Execute full pipeline with SLO tracking."""
start_time = time.time()
# 1. Retrieval
retrieval_start = time.time()
results = self.retriever.retrieve(
query=query,
topK=self.config.topK,
filters=filters,
expand_query=self.config.enable_query_expansion
)
retrieval_time = (time.time() - retrieval_start) * 1000
# 2. Reranking (optional)
if self.config.enable_reranking and self.reranker:
reranking_start = time.time()
results = self.reranker.rerank(query, results)
reranking_time = (time.time() - reranking_start) * 1000
else:
reranking_time = 0
# 3. Summarization
summarization_start = time.time()
summary = self.summarizer.summarize(
query=query,
contexts=results[:5], # Top 5 for summary
max_length=self.config.max_summary_length
)
summarization_time = (time.time() - summarization_start) * 1000
total_time = (time.time() - start_time) * 1000
# Check SLO compliance
slo_met = total_time <= self.config.latency_budget_ms
return PipelineResult(
query=query,
results=results,
summary=summary,
timing={
"retrieval_ms": retrieval_time,
"reranking_ms": reranking_time,
"summarization_ms": summarization_time,
"total_ms": total_time
},
slo_met=slo_met,
profile=self.config.profile
)
๐ Evaluation Framework (RAGAS + Custom)โ
RAGAS Metrics as Release Gatesโ
class EvaluationGates:
"""
RAGAS-based evaluation gates for CI/CD.
Gates:
- Context Precision: Relevant chunks in top-K
- Context Recall: Coverage of ground truth
- Faithfulness: No hallucinations
- Response Relevancy: Answer matches query
- Noise Sensitivity: Robust to irrelevant docs
"""
PROFILE_THRESHOLDS = {
"balanced": {
"context_precision": 0.70,
"context_recall": 0.70,
"faithfulness": 0.85,
"response_relevancy": 0.75,
"noise_sensitivity": 0.80,
"latency_p95_ms": 500
},
"latency_first": {
"response_relevancy": 0.70,
"latency_p95_ms": 250,
"throughput_qps": 200
},
"quality_first": {
"context_precision": 0.85,
"context_recall": 0.85,
"faithfulness": 0.95,
"citation_accuracy": 0.90,
"latency_p95_ms": 5000
}
}
def evaluate_profile(
self,
profile: str,
pipeline: DocumentSearchPipeline,
test_fixtures: List[TestCase]
) -> EvaluationReport:
"""
Evaluate pipeline against profile thresholds.
Returns:
- pass/fail for each metric
- detailed scores
- recommendations
"""
thresholds = self.PROFILE_THRESHOLDS[profile]
# Run RAGAS evaluation
results = []
latencies = []
for test_case in test_fixtures:
result = pipeline.execute(
query=test_case.query,
filters=test_case.filters
)
# Calculate RAGAS metrics
metrics = self._calculate_ragas_metrics(
query=test_case.query,
contexts=result.results,
answer=result.summary.text,
ground_truth=test_case.ground_truth
)
results.append(metrics)
latencies.append(result.timing["total_ms"])
# Aggregate metrics
avg_metrics = self._aggregate_metrics(results)
p95_latency = np.percentile(latencies, 95)
# Check gates
gates_passed = {}
for metric, threshold in thresholds.items():
if metric == "latency_p95_ms":
gates_passed[metric] = p95_latency <= threshold
else:
gates_passed[metric] = avg_metrics.get(metric, 0) >= threshold
all_passed = all(gates_passed.values())
return EvaluationReport(
profile=profile,
metrics=avg_metrics,
latency_p95=p95_latency,
gates_passed=gates_passed,
all_gates_passed=all_passed,
recommendations=self._generate_recommendations(avg_metrics, thresholds)
)
def _calculate_ragas_metrics(
self,
query: str,
contexts: List[RetrievalResult],
answer: str,
ground_truth: str
) -> Dict[str, float]:
"""Calculate RAGAS metrics for a single query."""
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy
)
# Prepare data for RAGAS
data = {
"question": [query],
"contexts": [[ctx.content for ctx in contexts]],
"answer": [answer],
"ground_truth": [ground_truth]
}
# Evaluate
result = evaluate(
dataset=data,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy
]
)
return {
"context_precision": result["context_precision"],
"context_recall": result["context_recall"],
"faithfulness": result["faithfulness"],
"response_relevancy": result["answer_relevancy"]
}
๐ญ User Stories & Business Framingโ
Story 1: Knowledge Assistant (Support Tickets)โ
Scenario: Support engineer needs to resolve customer ticket using KB articles
Profile: Balanced
Flow:
- Engineer enters customer question
- System searches KB with hybrid retrieval
- Returns top 5 relevant articles with citations
- Generates grounded summary with inline citations
- Engineer verifies citations and responds to customer
Success Metrics:
- Time to resolution: < 5 minutes (down from 15 minutes)
- Citation accuracy: > 90%
- Customer satisfaction: > 4.5/5
Evaluation:
- Faithfulness > 0.85 (no hallucinations)
- Coverage > 0.7 (answer addresses query)
- Latency < 500ms (interactive experience)
Story 2: Policy/Compliance QA (Audit-Ready)โ
Scenario: Compliance officer verifies policy adherence
Profile: Quality-First
Flow:
- Officer asks complex compliance question
- System searches only approved policy corpus
- Deep reranking ensures highest relevance
- Multi-document synthesis with strict citations
- Fail closed if coverage < 80%
- Log decision for audit trail
Success Metrics:
- Citation accuracy: > 95%
- Zero unapproved sources used
- Audit log completeness: 100%
Evaluation:
- Faithfulness > 0.95 (strict)
- Context precision > 0.85
- Coverage > 0.8 (or fail closed)
Story 3: Engineering Docs Search (Team Updates)โ
Scenario: Engineering manager prepares quarterly update using design docs
Profile: Quality-First (with faceted filters)
Flow:
- Manager searches across design docs by team and quarter
- Apply facets: team=backend, quarter=Q3, status=approved
- System retrieves and synthesizes key decisions
- Generates multi-document summary with timeline
- Manager reviews and shares with leadership
Success Metrics:
- Time to prepare: < 30 minutes (down from 4 hours)
- Comprehensive coverage: > 90% of key decisions
- Leadership satisfaction: > 4.7/5
Evaluation:
- Context recall > 0.85 (comprehensive)
- Multi-doc coherence > 0.8
- Citation accuracy > 0.9
๐ Implementation Roadmapโ
Phase 0: Foundation (Week 1)โ
Goal: Lock profiles, stub interfaces, run minimal balanced flow
Tasks:
- โ Define three profile configurations
- โฌ Implement
DocumentStoreinterface - โฌ Implement
HybridDocumentRetriever(basic) - โฌ Implement
GroundedSummarizer(extractive only) - โฌ Implement
DocumentSearchPipelinewith balanced profile - โฌ Create 10-query test fixture
- โฌ Run baseline evaluation
Deliverable: End-to-end balanced flow with baseline metrics
Success Criteria:
- Pipeline executes without errors
- Baseline metrics recorded
- P95 latency measured
Phase 1: Document Loading (Week 2)โ
Goal: Load PDF, DOCX, XLSX with metadata extraction
Tasks:
- โฌ Implement
PDFLoader(pypdf) - โฌ Implement
DOCXLoader(python-docx) - โฌ Implement
ExcelLoader(openpyxl) - โฌ Implement
DocumentLoaderFactory - โฌ Add S3 upload/download integration
- โฌ Write loader tests (each format)
- โฌ Integration with existing chunker
Deliverable: Universal document loader for 3 formats
Success Criteria:
- All 3 formats load correctly
- Metadata extracted accurately
- Test coverage > 85%
Phase 2: Search Enhancement (Week 3)โ
Goal: Complete hybrid retrieval with query expansion and facets
Tasks:
- โฌ Implement query expansion (PRF and HyDE)
- โฌ Add intent detection and gating
- โฌ Implement faceted filtering
- โฌ Add RRF score fusion
- โฌ Optimize OpenSearch queries
- โฌ Add result caching (Redis)
- โฌ Performance testing
Deliverable: Production-grade hybrid retrieval
Success Criteria:
- Context precision > 0.7
- P95 latency < 200ms
- Cache hit rate > 40%
Phase 3: Grounded Summarization (Week 4)โ
Goal: Extractive + abstractive summarization with citations
Tasks:
- โฌ Complete extractive summarizer (TextRank)
- โฌ Implement abstractive summarizer (GPT-3.5)
- โฌ Add citation extraction and tracking
- โฌ Implement faithfulness verification
- โฌ Add fail-closed logic for low faithfulness
- โฌ Multi-document summarization
- โฌ Summarization tests
Deliverable: Grounded summarization with citations
Success Criteria:
- Faithfulness > 0.85
- Citation accuracy > 0.9
- No hallucinations detected
Phase 4: Profile Implementation (Week 5)โ
Goal: Implement all three profiles with SLO tracking
Tasks:
- โฌ Implement latency-first profile
- โฌ Implement quality-first profile
- โฌ Add SLO tracking and alerting
- โฌ Profile-specific optimizations
- โฌ A/B testing framework
- โฌ Profile comparison tool
Deliverable: Three production-ready profiles
Success Criteria:
- All profiles meet SLOs
- Evaluation gates pass
- Profile selection logic works
Phase 5: Evaluation Harness (Week 6)โ
Goal: RAGAS-based CI/CD gates
Tasks:
- โฌ Implement RAGAS evaluation wrapper
- โฌ Create tenant test fixtures (50+ queries)
- โฌ Add custom metrics (citation accuracy, coverage)
- โฌ Integrate with CI/CD pipeline
- โฌ Set up regression detection
- โฌ Create evaluation dashboard
Deliverable: Automated evaluation in CI
Success Criteria:
- RAGAS metrics calculated correctly
- CI gates block failing PRs
- Regression detection works
Phase 6: API & Documentation (Week 7)โ
Goal: REST API with Docusaurus docs
Tasks:
- โฌ Implement FastAPI endpoints (7 endpoints)
- โฌ Add authentication and rate limiting
- โฌ OpenAPI/Swagger documentation
- โฌ Docusaurus setup with clear IA
- โฌ Tutorial for each profile
- โฌ User story examples
- โฌ Algolia DocSearch integration
Deliverable: Complete API with docs
Success Criteria:
- All endpoints documented
- Tutorials runnable
- Search works in docs
Phase 7: Production Deployment (Week 8)โ
Goal: Deploy with monitoring and alerting
Tasks:
- โฌ OpenSearch provisioned setup (or Serverless)
- โฌ S3 bucket configuration
- โฌ MongoDB Atlas setup
- โฌ Redis cache setup
- โฌ Prometheus + Grafana dashboards
- โฌ Alert rules and runbooks
- โฌ Load testing (1000+ QPS)
- โฌ Security audit
Deliverable: Production-ready deployment
Success Criteria:
- 99.9% uptime
- All SLOs met
- Alerts working
- Security approved
๐ Monitoring & SLOsโ
Profile-Based SLOsโ
| Metric | Balanced | Latency-First | Quality-First |
|---|---|---|---|
| P95 Latency | < 500ms | < 250ms | < 5000ms |
| Context Precision | > 0.70 | - | > 0.85 |
| Context Recall | > 0.70 | - | > 0.85 |
| Faithfulness | > 0.85 | - | > 0.95 |
| Response Relevancy | > 0.75 | > 0.70 | > 0.80 |
| Throughput | 100 QPS | 200 QPS | 20 QPS |
| Availability | 99.9% | 99.9% | 99.5% |
Dashboardsโ
- System Health: Latency, error rate, throughput
- Quality Metrics: RAGAS scores over time
- Profile Performance: Per-profile SLO tracking
- Cost Tracking: LLM costs, storage, compute
- User Stories: Metrics per user story
๐งช Test Fixturesโ
Tenant Fixture Structureโ
{
"fixture_name": "support_kb_fixture",
"description": "Customer support knowledge base queries",
"profile": "balanced",
"test_cases": [
{
"query": "How do I reset my password?",
"filters": {"category": "account"},
"ground_truth": "Navigate to Settings > Security > Reset Password...",
"expected_sources": ["doc_123", "doc_456"],
"expected_latency_ms": 400,
"slo_requirements": {
"context_precision": 0.75,
"faithfulness": 0.90
}
}
]
}
Initial Test Fixtures (Week 1)โ
- Support KB Fixture - 10 queries
- Policy Compliance Fixture - 10 queries
- Engineering Docs Fixture - 10 queries
Total: 30 queries for baseline
Expanded Fixtures (Week 6)โ
- 50+ queries per fixture
- Edge cases and adversarial queries
- Multi-language support (future)
- Domain-specific fixtures
๐ Documentation Structure (Docusaurus)โ
Information Architectureโ
docs/
โโโ getting-started/
โ โโโ introduction.md
โ โโโ quick-start.md
โ โโโ installation.md
โโโ tutorials/
โ โโโ balanced-profile-tutorial.md
โ โโโ latency-first-tutorial.md
โ โโโ quality-first-tutorial.md
โ โโโ custom-profile-tutorial.md
โโโ user-stories/
โ โโโ knowledge-assistant.md
โ โโโ policy-compliance.md
โ โโโ engineering-docs.md
โโโ api-reference/
โ โโโ overview.md
โ โโโ upload-api.md
โ โโโ search-api.md
โ โโโ summarization-api.md
โโโ guides/
โ โโโ profile-selection.md
โ โโโ query-optimization.md
โ โ โโ citation-handling.md
โ โโโ evaluation-metrics.md
โโโ deployment/
โ โโโ opensearch-provisioned.md
โ โโโ opensearch-serverless.md
โ โโโ production-checklist.md
โโโ troubleshooting/
โโโ latency-issues.md
โโโ quality-issues.md
โโโ common-errors.md
Search Integrationโ
Option 1: Algolia DocSearch (Recommended)
- Contextual search in docs
- Instant results
- Analytics
Option 2: Local search plugin
- For air-gapped deployments
- Self-hosted
- Privacy-focused
๐ฐ Refined Cost Analysisโ
Profile-Based Costs (per 1000 queries)โ
| Profile | Compute | LLM API | Storage I/O | Total |
|---|---|---|---|---|
| Balanced | $0.50 | $0.00 | $0.10 | $0.60 |
| Latency-First | $0.30 | $0.00 | $0.05 | $0.35 |
| Quality-First | $2.00 | $50.00 | $0.50 | $52.50 |
Notes:
- Balanced: Extractive only (no LLM costs)
- Quality-First: Abstractive with GPT-3.5 ($0.05 per summary)
- Caching reduces costs by 40-60%
Monthly Infrastructure (10,000 users)โ
| Component | Size | Cost |
|---|---|---|
| OpenSearch (provisioned) | 3x m5.xlarge.search | $350 |
| S3 | 5TB docs | $115 |
| MongoDB Atlas | M30 | $210 |
| Redis | cache.m5.large | $150 |
| Compute (ECS) | 10x t3.large | $730 |
| Total | $1,555/month |
Cost per user: $0.16/month
๐ฏ Immediate Next Stepsโ
This Week (Week 0)โ
- โ Lock profile definitions
- โฌ Stub
DocumentStore,HybridRetriever,GroundedSummarizerinterfaces - โฌ Implement minimal balanced profile
- โฌ Create 10-query test fixture
- โฌ Run baseline evaluation
- โฌ Measure baseline latency and metrics
Goal: End-to-end flow by Friday
Next Week (Week 1)โ
- โฌ Implement document loaders (PDF, DOCX, XLSX)
- โฌ S3 integration
- โฌ Expand test fixtures to 30 queries
- โฌ Optimize baseline metrics
Goal: Full document loading pipeline
๐ Key Contacts & Reviewโ
Technical Review: Backend team, ML team
Business Review: Product manager
Security Review: Security engineer
Final Approval: CTO
Next Review: End of Week 1 (baseline metrics)
This refined plan incorporates all the production best practices while staying focused on delivering value through user stories. Ready to start implementation! ๐