Document Search & Summarization - Refined Implementation Plan

Created: October 9, 2025
Status: Ready for Implementation
Version: 2.0 (Refined with Production Best Practices)

🎯 Key Decisions

✅ Storage: S3 for documents
✅ Default Summarization: Extractive (with grounded citations)
✅ Initial Formats: PDF, DOCX, XLSX (expand later)
✅ Processing Mode: Adaptive (sync < 10MB, async >= 10MB)
✅ Architecture: Profile-based pipeline (balanced, latency-first, quality-first)

📐 Architecture: Profile-Based Pipeline

Core Concept

Instead of one-size-fits-all, we define three profiles with different SLOs:

Pipeline = Retriever → Reranker → Summarizer
            ↓           ↓           ↓
         Profile-based configuration

Three Profiles

1. Balanced Profile (Default)

Use Case: General-purpose document Q&A

Component	Configuration	Target
Retrieval	topK=20, hybrid (BM25 + vector, α=0.5)	< 200ms
Reranking	Light cross-encoder, topK=10	< 100ms
Summarization	Extractive multi-chunk with citations	< 200ms
Total P95 Latency		< 500ms

Metrics Gates:

Context Precision > 0.7
Context Recall > 0.7
Faithfulness > 0.85
Response Relevancy > 0.75

2. Latency-First Profile

Use Case: Interactive search, auto-complete, real-time assistants

Component	Configuration	Target
Retrieval	topK=10, vector-only or BM25-only	< 100ms
Reranking	Shallow or skip	< 50ms
Summarization	Single-chunk extractive, tight window	< 100ms
Total P95 Latency		< 250ms

Metrics Gates:

Response Relevancy > 0.7
Latency P95 < 250ms
Throughput > 200 QPS

3. Quality-First Profile

Use Case: Complex research, legal compliance, multi-document synthesis

Component	Configuration	Target
Retrieval	topK=50, hybrid with query expansion	< 500ms
Reranking	Deep cross-encoder + LLM reranking	< 1000ms
Summarization	Multi-doc abstractive with strict citations	< 3000ms
Total P95 Latency		< 5000ms

Metrics Gates:

Context Precision > 0.85
Context Recall > 0.85
Faithfulness > 0.95
Citation Accuracy > 0.9

🏗️ Component Architecture

1. Store Interface (Unified)

class DocumentStore(ABC):
    """Unified store interface for consistency across backends."""
    
    @abstractmethod
    def hybrid_search(
        self,
        query: str,
        query_embedding: List[float],
        filters: Dict[str, Any],
        topK: int,
        alpha: float  # BM25 vs vector weight
    ) -> List[SearchResult]:
        """Hybrid BM25 + ANN search with score fusion."""
        pass
    
    @abstractmethod
    def get_facets(
        self,
        query: str,
        filters: Dict[str, Any]
    ) -> Dict[str, List[FacetValue]]:
        """Get faceted navigation options."""
        pass

Backends:

OpenSearch (provisioned) - Full control, advanced aggregations
OpenSearch Serverless - Low-ops, auto-scaling
MongoDB Atlas Vector Search - Alternative backend

2. Retriever (Hybrid)

class HybridDocumentRetriever:
    """
    Hybrid BM25 + ANN retrieval with configurable fusion.
    
    Features:
    - Reciprocal Rank Fusion (RRF)
    - Query expansion (PRF, HyDE)
    - Intent-based gating
    - Faceted filtering
    """
    
    def __init__(
        self,
        store: DocumentStore,
        alpha: float = 0.5,  # BM25 vs vector weight
        query_expansion: Optional[QueryExpander] = None,
        filter_builder: Optional[FilterBuilder] = None
    ):
        self.store = store
        self.alpha = alpha
        self.query_expansion = query_expansion
        self.filter_builder = filter_builder
    
    def retrieve(
        self,
        query: str,
        topK: int = 20,
        filters: Optional[Dict[str, Any]] = None,
        expand_query: bool = True
    ) -> List[RetrievalResult]:
        """Execute hybrid retrieval with optional query expansion."""
        
        # Intent detection & gating
        intent = self._detect_intent(query)
        
        # Query expansion (PRF or HyDE)
        if expand_query and intent in ["complex", "research"]:
            expanded_queries = self.query_expansion.expand(query, method="hyde")
        else:
            expanded_queries = [query]
        
        # Build filters
        if self.filter_builder:
            filters = self.filter_builder.build(query, filters)
        
        # Execute hybrid search for each expanded query
        all_results = []
        for exp_query in expanded_queries:
            embedding = self._get_embedding(exp_query)
            results = self.store.hybrid_search(
                query=exp_query,
                query_embedding=embedding,
                filters=filters,
                topK=topK,
                alpha=self.alpha
            )
            all_results.extend(results)
        
        # Reciprocal Rank Fusion
        fused_results = self._reciprocal_rank_fusion(all_results, k=60)
        
        return fused_results[:topK]

3. Grounded Summarizer (Citation-Aware)

class GroundedSummarizer:
    """
    Grounded summarization with explicit citations.
    
    Key Features:
    - Only uses retrieved context (no hallucinations)
    - Extractive and abstractive modes
    - Citation tracking at sentence level
    - Coverage and faithfulness metrics
    """
    
    def __init__(
        self,
        mode: str = "extractive",  # extractive, abstractive
        citation_style: str = "inline"  # inline, footnote
    ):
        self.mode = mode
        self.citation_style = citation_style
    
    def summarize(
        self,
        query: str,
        contexts: List[RetrievalResult],
        max_length: int = 250
    ) -> GroundedSummary:
        """Generate grounded summary with citations."""
        
        if self.mode == "extractive":
            return self._extractive_summary(query, contexts, max_length)
        else:
            return self._abstractive_summary(query, contexts, max_length)
    
    def _extractive_summary(
        self,
        query: str,
        contexts: List[RetrievalResult],
        max_length: int
    ) -> GroundedSummary:
        """Extract key sentences with citations."""
        
        # Score sentences by relevance to query
        sentences = []
        for ctx in contexts:
            for sent in self._split_sentences(ctx.content):
                score = self._score_sentence(sent, query)
                sentences.append({
                    "text": sent,
                    "score": score,
                    "source": ctx.metadata["document_id"],
                    "chunk_id": ctx.chunk_id
                })
        
        # Sort and select top sentences
        sentences.sort(key=lambda x: x["score"], reverse=True)
        
        # Build summary with citations
        summary_parts = []
        total_length = 0
        used_sources = set()
        
        for sent_info in sentences:
            sent_length = len(sent_info["text"].split())
            if total_length + sent_length > max_length:
                break
            
            # Add citation
            citation = f'[{len(used_sources) + 1}]'
            summary_parts.append(f'{sent_info["text"]} {citation}')
            used_sources.add(sent_info["source"])
            total_length += sent_length
        
        summary_text = " ".join(summary_parts)
        
        # Build citation map
        citations = {
            idx + 1: {
                "document_id": src,
                "document_title": self._get_document_title(src)
            }
            for idx, src in enumerate(used_sources)
        }
        
        return GroundedSummary(
            text=summary_text,
            citations=citations,
            coverage=self._calculate_coverage(contexts, summary_parts),
            faithfulness=1.0,  # Extractive is always faithful
            mode="extractive"
        )
    
    def _abstractive_summary(
        self,
        query: str,
        contexts: List[RetrievalResult],
        max_length: int
    ) -> GroundedSummary:
        """Generate abstractive summary with LLM + strict grounding."""
        
        # Build context with source IDs
        context_text = ""
        for idx, ctx in enumerate(contexts, 1):
            context_text += f"\n[Source {idx}]: {ctx.content}\n"
        
        # Prompt engineering for grounded generation
        prompt = f"""Based ONLY on the provided sources, answer this question: {query}

Sources:
{context_text}

Instructions:
1. Only use information from the sources above
2. Cite sources using [1], [2], etc. after each claim
3. If information is not in sources, say "Information not available"
4. Keep summary under {max_length} words
5. Be factual and specific

Answer:"""
        
        # Generate with LLM
        response = self.llm.invoke(prompt)
        summary_text = response.content
        
        # Extract citations from generated text
        citations = self._extract_citations(summary_text, contexts)
        
        # Verify faithfulness
        faithfulness = self._verify_faithfulness(summary_text, contexts)
        
        # Fail closed if faithfulness is low
        if faithfulness < 0.8:
            return self._extractive_summary(query, contexts, max_length)
        
        return GroundedSummary(
            text=summary_text,
            citations=citations,
            coverage=self._calculate_coverage(contexts, [summary_text]),
            faithfulness=faithfulness,
            mode="abstractive"
        )

4. Pipeline Abstraction

class DocumentSearchPipeline:
    """
    Configurable pipeline: Retriever → Reranker → Summarizer
    
    Profiles:
    - balanced: General-purpose, moderate quality & latency
    - latency_first: Interactive, fast response
    - quality_first: Research-grade, comprehensive
    """
    
    @classmethod
    def create_profile(cls, profile: str, store: DocumentStore) -> "DocumentSearchPipeline":
        """Factory method for profile-based pipelines."""
        
        if profile == "balanced":
            return cls(
                retriever=HybridDocumentRetriever(
                    store=store,
                    alpha=0.5,
                    query_expansion=QueryExpander(method="prf")
                ),
                reranker=CrossEncoderReranker(
                    model="ms-marco-MiniLM-L-6-v2",
                    topK=10
                ),
                summarizer=GroundedSummarizer(
                    mode="extractive",
                    citation_style="inline"
                ),
                config=PipelineConfig(
                    topK=20,
                    enable_reranking=True,
                    enable_query_expansion=True,
                    max_summary_length=250,
                    latency_budget_ms=500
                )
            )
        
        elif profile == "latency_first":
            return cls(
                retriever=HybridDocumentRetriever(
                    store=store,
                    alpha=0.7,  # Favor BM25 (faster)
                    query_expansion=None  # Skip expansion
                ),
                reranker=None,  # Skip reranking
                summarizer=GroundedSummarizer(
                    mode="extractive",
                    citation_style="inline"
                ),
                config=PipelineConfig(
                    topK=10,
                    enable_reranking=False,
                    enable_query_expansion=False,
                    max_summary_length=150,
                    latency_budget_ms=250
                )
            )
        
        elif profile == "quality_first":
            return cls(
                retriever=HybridDocumentRetriever(
                    store=store,
                    alpha=0.5,
                    query_expansion=QueryExpander(method="hyde")
                ),
                reranker=LLMReranker(  # Deep reranking
                    model="gpt-3.5-turbo",
                    topK=20
                ),
                summarizer=GroundedSummarizer(
                    mode="abstractive",
                    citation_style="inline"
                ),
                config=PipelineConfig(
                    topK=50,
                    enable_reranking=True,
                    enable_query_expansion=True,
                    max_summary_length=500,
                    latency_budget_ms=5000
                )
            )
    
    def execute(
        self,
        query: str,
        filters: Optional[Dict[str, Any]] = None
    ) -> PipelineResult:
        """Execute full pipeline with SLO tracking."""
        
        start_time = time.time()
        
        # 1. Retrieval
        retrieval_start = time.time()
        results = self.retriever.retrieve(
            query=query,
            topK=self.config.topK,
            filters=filters,
            expand_query=self.config.enable_query_expansion
        )
        retrieval_time = (time.time() - retrieval_start) * 1000
        
        # 2. Reranking (optional)
        if self.config.enable_reranking and self.reranker:
            reranking_start = time.time()
            results = self.reranker.rerank(query, results)
            reranking_time = (time.time() - reranking_start) * 1000
        else:
            reranking_time = 0
        
        # 3. Summarization
        summarization_start = time.time()
        summary = self.summarizer.summarize(
            query=query,
            contexts=results[:5],  # Top 5 for summary
            max_length=self.config.max_summary_length
        )
        summarization_time = (time.time() - summarization_start) * 1000
        
        total_time = (time.time() - start_time) * 1000
        
        # Check SLO compliance
        slo_met = total_time <= self.config.latency_budget_ms
        
        return PipelineResult(
            query=query,
            results=results,
            summary=summary,
            timing={
                "retrieval_ms": retrieval_time,
                "reranking_ms": reranking_time,
                "summarization_ms": summarization_time,
                "total_ms": total_time
            },
            slo_met=slo_met,
            profile=self.config.profile
        )

📊 Evaluation Framework (RAGAS + Custom)

RAGAS Metrics as Release Gates

class EvaluationGates:
    """
    RAGAS-based evaluation gates for CI/CD.
    
    Gates:
    - Context Precision: Relevant chunks in top-K
    - Context Recall: Coverage of ground truth
    - Faithfulness: No hallucinations
    - Response Relevancy: Answer matches query
    - Noise Sensitivity: Robust to irrelevant docs
    """
    
    PROFILE_THRESHOLDS = {
        "balanced": {
            "context_precision": 0.70,
            "context_recall": 0.70,
            "faithfulness": 0.85,
            "response_relevancy": 0.75,
            "noise_sensitivity": 0.80,
            "latency_p95_ms": 500
        },
        "latency_first": {
            "response_relevancy": 0.70,
            "latency_p95_ms": 250,
            "throughput_qps": 200
        },
        "quality_first": {
            "context_precision": 0.85,
            "context_recall": 0.85,
            "faithfulness": 0.95,
            "citation_accuracy": 0.90,
            "latency_p95_ms": 5000
        }
    }
    
    def evaluate_profile(
        self,
        profile: str,
        pipeline: DocumentSearchPipeline,
        test_fixtures: List[TestCase]
    ) -> EvaluationReport:
        """
        Evaluate pipeline against profile thresholds.
        
        Returns:
        - pass/fail for each metric
        - detailed scores
        - recommendations
        """
        
        thresholds = self.PROFILE_THRESHOLDS[profile]
        
        # Run RAGAS evaluation
        results = []
        latencies = []
        
        for test_case in test_fixtures:
            result = pipeline.execute(
                query=test_case.query,
                filters=test_case.filters
            )
            
            # Calculate RAGAS metrics
            metrics = self._calculate_ragas_metrics(
                query=test_case.query,
                contexts=result.results,
                answer=result.summary.text,
                ground_truth=test_case.ground_truth
            )
            
            results.append(metrics)
            latencies.append(result.timing["total_ms"])
        
        # Aggregate metrics
        avg_metrics = self._aggregate_metrics(results)
        p95_latency = np.percentile(latencies, 95)
        
        # Check gates
        gates_passed = {}
        for metric, threshold in thresholds.items():
            if metric == "latency_p95_ms":
                gates_passed[metric] = p95_latency <= threshold
            else:
                gates_passed[metric] = avg_metrics.get(metric, 0) >= threshold
        
        all_passed = all(gates_passed.values())
        
        return EvaluationReport(
            profile=profile,
            metrics=avg_metrics,
            latency_p95=p95_latency,
            gates_passed=gates_passed,
            all_gates_passed=all_passed,
            recommendations=self._generate_recommendations(avg_metrics, thresholds)
        )
    
    def _calculate_ragas_metrics(
        self,
        query: str,
        contexts: List[RetrievalResult],
        answer: str,
        ground_truth: str
    ) -> Dict[str, float]:
        """Calculate RAGAS metrics for a single query."""
        
        from ragas import evaluate
        from ragas.metrics import (
            context_precision,
            context_recall,
            faithfulness,
            answer_relevancy
        )
        
        # Prepare data for RAGAS
        data = {
            "question": [query],
            "contexts": [[ctx.content for ctx in contexts]],
            "answer": [answer],
            "ground_truth": [ground_truth]
        }
        
        # Evaluate
        result = evaluate(
            dataset=data,
            metrics=[
                context_precision,
                context_recall,
                faithfulness,
                answer_relevancy
            ]
        )
        
        return {
            "context_precision": result["context_precision"],
            "context_recall": result["context_recall"],
            "faithfulness": result["faithfulness"],
            "response_relevancy": result["answer_relevancy"]
        }

🎭 User Stories & Business Framing

Story 1: Knowledge Assistant (Support Tickets)

Scenario: Support engineer needs to resolve customer ticket using KB articles

Profile: Balanced

Flow:

Engineer enters customer question
System searches KB with hybrid retrieval
Returns top 5 relevant articles with citations
Generates grounded summary with inline citations
Engineer verifies citations and responds to customer

Success Metrics:

Time to resolution: < 5 minutes (down from 15 minutes)
Citation accuracy: > 90%
Customer satisfaction: > 4.5/5

Evaluation:

Faithfulness > 0.85 (no hallucinations)
Coverage > 0.7 (answer addresses query)
Latency < 500ms (interactive experience)

Story 2: Policy/Compliance QA (Audit-Ready)

Scenario: Compliance officer verifies policy adherence

Profile: Quality-First

Flow:

Officer asks complex compliance question
System searches only approved policy corpus
Deep reranking ensures highest relevance
Multi-document synthesis with strict citations
Fail closed if coverage < 80%
Log decision for audit trail

Success Metrics:

Citation accuracy: > 95%
Zero unapproved sources used
Audit log completeness: 100%

Evaluation:

Faithfulness > 0.95 (strict)
Context precision > 0.85
Coverage > 0.8 (or fail closed)

Story 3: Engineering Docs Search (Team Updates)

Scenario: Engineering manager prepares quarterly update using design docs

Profile: Quality-First (with faceted filters)

Flow:

Manager searches across design docs by team and quarter
Apply facets: team=backend, quarter=Q3, status=approved
System retrieves and synthesizes key decisions
Generates multi-document summary with timeline
Manager reviews and shares with leadership

Success Metrics:

Time to prepare: < 30 minutes (down from 4 hours)
Comprehensive coverage: > 90% of key decisions
Leadership satisfaction: > 4.7/5

Evaluation:

Context recall > 0.85 (comprehensive)
Multi-doc coherence > 0.8
Citation accuracy > 0.9

🚀 Implementation Roadmap

Phase 0: Foundation (Week 1)

Goal: Lock profiles, stub interfaces, run minimal balanced flow

Tasks:

✅ Define three profile configurations
⬜ Implement DocumentStore interface
⬜ Implement HybridDocumentRetriever (basic)
⬜ Implement GroundedSummarizer (extractive only)
⬜ Implement DocumentSearchPipeline with balanced profile
⬜ Create 10-query test fixture
⬜ Run baseline evaluation

Deliverable: End-to-end balanced flow with baseline metrics

Success Criteria:

Pipeline executes without errors
Baseline metrics recorded
P95 latency measured

Phase 1: Document Loading (Week 2)

Goal: Load PDF, DOCX, XLSX with metadata extraction

Tasks:

⬜ Implement PDFLoader (pypdf)
⬜ Implement DOCXLoader (python-docx)
⬜ Implement ExcelLoader (openpyxl)
⬜ Implement DocumentLoaderFactory
⬜ Add S3 upload/download integration
⬜ Write loader tests (each format)
⬜ Integration with existing chunker

Deliverable: Universal document loader for 3 formats

Success Criteria:

All 3 formats load correctly
Metadata extracted accurately
Test coverage > 85%

Phase 2: Search Enhancement (Week 3)

Goal: Complete hybrid retrieval with query expansion and facets

Tasks:

⬜ Implement query expansion (PRF and HyDE)
⬜ Add intent detection and gating
⬜ Implement faceted filtering
⬜ Add RRF score fusion
⬜ Optimize OpenSearch queries
⬜ Add result caching (Redis)
⬜ Performance testing

Deliverable: Production-grade hybrid retrieval

Success Criteria:

Context precision > 0.7
P95 latency < 200ms
Cache hit rate > 40%

Phase 3: Grounded Summarization (Week 4)

Goal: Extractive + abstractive summarization with citations

Tasks:

⬜ Complete extractive summarizer (TextRank)
⬜ Implement abstractive summarizer (GPT-3.5)
⬜ Add citation extraction and tracking
⬜ Implement faithfulness verification
⬜ Add fail-closed logic for low faithfulness
⬜ Multi-document summarization
⬜ Summarization tests

Deliverable: Grounded summarization with citations

Success Criteria:

Faithfulness > 0.85
Citation accuracy > 0.9
No hallucinations detected

Phase 4: Profile Implementation (Week 5)

Goal: Implement all three profiles with SLO tracking

Tasks:

⬜ Implement latency-first profile
⬜ Implement quality-first profile
⬜ Add SLO tracking and alerting
⬜ Profile-specific optimizations
⬜ A/B testing framework
⬜ Profile comparison tool

Deliverable: Three production-ready profiles

Success Criteria:

All profiles meet SLOs
Evaluation gates pass
Profile selection logic works

Phase 5: Evaluation Harness (Week 6)

Goal: RAGAS-based CI/CD gates

Tasks:

⬜ Implement RAGAS evaluation wrapper
⬜ Create tenant test fixtures (50+ queries)
⬜ Add custom metrics (citation accuracy, coverage)
⬜ Integrate with CI/CD pipeline
⬜ Set up regression detection
⬜ Create evaluation dashboard

Deliverable: Automated evaluation in CI

Success Criteria:

RAGAS metrics calculated correctly
CI gates block failing PRs
Regression detection works

Phase 6: API & Documentation (Week 7)

Goal: REST API with Docusaurus docs

Tasks:

⬜ Implement FastAPI endpoints (7 endpoints)
⬜ Add authentication and rate limiting
⬜ OpenAPI/Swagger documentation
⬜ Docusaurus setup with clear IA
⬜ Tutorial for each profile
⬜ User story examples
⬜ Algolia DocSearch integration

Deliverable: Complete API with docs

Success Criteria:

All endpoints documented
Tutorials runnable
Search works in docs

Phase 7: Production Deployment (Week 8)

Goal: Deploy with monitoring and alerting

Tasks:

⬜ OpenSearch provisioned setup (or Serverless)
⬜ S3 bucket configuration
⬜ MongoDB Atlas setup
⬜ Redis cache setup
⬜ Prometheus + Grafana dashboards
⬜ Alert rules and runbooks
⬜ Load testing (1000+ QPS)
⬜ Security audit

Deliverable: Production-ready deployment

Success Criteria:

99.9% uptime
All SLOs met
Alerts working
Security approved

📊 Monitoring & SLOs

Profile-Based SLOs

Metric	Balanced	Latency-First	Quality-First
P95 Latency	< 500ms	< 250ms	< 5000ms
Context Precision	> 0.70	-	> 0.85
Context Recall	> 0.70	-	> 0.85
Faithfulness	> 0.85	-	> 0.95
Response Relevancy	> 0.75	> 0.70	> 0.80
Throughput	100 QPS	200 QPS	20 QPS
Availability	99.9%	99.9%	99.5%

Dashboards

System Health: Latency, error rate, throughput
Quality Metrics: RAGAS scores over time
Profile Performance: Per-profile SLO tracking
Cost Tracking: LLM costs, storage, compute
User Stories: Metrics per user story

🧪 Test Fixtures

Tenant Fixture Structure

{
  "fixture_name": "support_kb_fixture",
  "description": "Customer support knowledge base queries",
  "profile": "balanced",
  "test_cases": [
    {
      "query": "How do I reset my password?",
      "filters": {"category": "account"},
      "ground_truth": "Navigate to Settings > Security > Reset Password...",
      "expected_sources": ["doc_123", "doc_456"],
      "expected_latency_ms": 400,
      "slo_requirements": {
        "context_precision": 0.75,
        "faithfulness": 0.90
      }
    }
  ]
}

Initial Test Fixtures (Week 1)

Support KB Fixture - 10 queries
Policy Compliance Fixture - 10 queries
Engineering Docs Fixture - 10 queries

Total: 30 queries for baseline

Expanded Fixtures (Week 6)

50+ queries per fixture
Edge cases and adversarial queries
Multi-language support (future)
Domain-specific fixtures

📖 Documentation Structure (Docusaurus)

Information Architecture

docs/
├── getting-started/
│   ├── introduction.md
│   ├── quick-start.md
│   └── installation.md
├── tutorials/
│   ├── balanced-profile-tutorial.md
│   ├── latency-first-tutorial.md
│   ├── quality-first-tutorial.md
│   └── custom-profile-tutorial.md
├── user-stories/
│   ├── knowledge-assistant.md
│   ├── policy-compliance.md
│   └── engineering-docs.md
├── api-reference/
│   ├── overview.md
│   ├── upload-api.md
│   ├── search-api.md
│   └── summarization-api.md
├── guides/
│   ├── profile-selection.md
│   ├── query-optimization.md
│   ├── citation-handling.md
│   └── evaluation-metrics.md
├── deployment/
│   ├── opensearch-provisioned.md
│   ├── opensearch-serverless.md
│   └── production-checklist.md
└── troubleshooting/
    ├── latency-issues.md
    ├── quality-issues.md
    └── common-errors.md

Search Integration

Option 1: Algolia DocSearch (Recommended)

Contextual search in docs
Instant results
Analytics

Option 2: Local search plugin

For air-gapped deployments
Self-hosted
Privacy-focused

💰 Refined Cost Analysis

Profile-Based Costs (per 1000 queries)

Profile	Compute	LLM API	Storage I/O	Total
Balanced	$0.50	$0.00	$0.10	$0.60
Latency-First	$0.30	$0.00	$0.05	$0.35
Quality-First	$2.00	$50.00	$0.50	$52.50

Notes:

Balanced: Extractive only (no LLM costs)
Quality-First: Abstractive with GPT-3.5 ($0.05 per summary)
Caching reduces costs by 40-60%

Monthly Infrastructure (10,000 users)

Component	Size	Cost
OpenSearch (provisioned)	3x m5.xlarge.search	$350
S3	5TB docs	$115
MongoDB Atlas	M30	$210
Redis	cache.m5.large	$150
Compute (ECS)	10x t3.large	$730
Total		$1,555/month

Cost per user: $0.16/month

🎯 Immediate Next Steps

This Week (Week 0)

✅ Lock profile definitions
⬜ Stub DocumentStore, HybridRetriever, GroundedSummarizer interfaces
⬜ Implement minimal balanced profile
⬜ Create 10-query test fixture
⬜ Run baseline evaluation
⬜ Measure baseline latency and metrics

Goal: End-to-end flow by Friday

Next Week (Week 1)

⬜ Implement document loaders (PDF, DOCX, XLSX)
⬜ S3 integration
⬜ Expand test fixtures to 30 queries
⬜ Optimize baseline metrics

Goal: Full document loading pipeline

📞 Key Contacts & Review

Technical Review: Backend team, ML team
Business Review: Product manager
Security Review: Security engineer
Final Approval: CTO

Next Review: End of Week 1 (baseline metrics)

This refined plan incorporates all the production best practices while staying focused on delivering value through user stories. Ready to start implementation! 🚀

🎯 Key Decisions​

📐 Architecture: Profile-Based Pipeline​

Core Concept​

Three Profiles​

1. Balanced Profile (Default)​

2. Latency-First Profile​

3. Quality-First Profile​

🏗️ Component Architecture​

1. Store Interface (Unified)​

2. Retriever (Hybrid)​

3. Grounded Summarizer (Citation-Aware)​

4. Pipeline Abstraction​

📊 Evaluation Framework (RAGAS + Custom)​

RAGAS Metrics as Release Gates​

🎭 User Stories & Business Framing​

Story 1: Knowledge Assistant (Support Tickets)​

Story 2: Policy/Compliance QA (Audit-Ready)​

Story 3: Engineering Docs Search (Team Updates)​

🚀 Implementation Roadmap​

Phase 0: Foundation (Week 1)​

Phase 1: Document Loading (Week 2)​

Phase 2: Search Enhancement (Week 3)​

Phase 3: Grounded Summarization (Week 4)​

Phase 4: Profile Implementation (Week 5)​

Phase 5: Evaluation Harness (Week 6)​

Phase 6: API & Documentation (Week 7)​

Phase 7: Production Deployment (Week 8)​

📊 Monitoring & SLOs​

Profile-Based SLOs​

Dashboards​

🧪 Test Fixtures​

Tenant Fixture Structure​

Initial Test Fixtures (Week 1)​

Expanded Fixtures (Week 6)​

📖 Documentation Structure (Docusaurus)​

Information Architecture​

Search Integration​

💰 Refined Cost Analysis​

Profile-Based Costs (per 1000 queries)​

Monthly Infrastructure (10,000 users)​

🎯 Immediate Next Steps​

This Week (Week 0)​

Next Week (Week 1)​

📞 Key Contacts & Review​

🎯 Key Decisions

📐 Architecture: Profile-Based Pipeline

Core Concept

Three Profiles

1. Balanced Profile (Default)

2. Latency-First Profile

3. Quality-First Profile

🏗️ Component Architecture

1. Store Interface (Unified)

2. Retriever (Hybrid)

3. Grounded Summarizer (Citation-Aware)

4. Pipeline Abstraction

📊 Evaluation Framework (RAGAS + Custom)

RAGAS Metrics as Release Gates

🎭 User Stories & Business Framing

Story 1: Knowledge Assistant (Support Tickets)

Story 2: Policy/Compliance QA (Audit-Ready)

Story 3: Engineering Docs Search (Team Updates)

🚀 Implementation Roadmap

Phase 0: Foundation (Week 1)

Phase 1: Document Loading (Week 2)

Phase 2: Search Enhancement (Week 3)

Phase 3: Grounded Summarization (Week 4)

Phase 4: Profile Implementation (Week 5)

Phase 5: Evaluation Harness (Week 6)

Phase 6: API & Documentation (Week 7)

Phase 7: Production Deployment (Week 8)

📊 Monitoring & SLOs

Profile-Based SLOs

Dashboards

🧪 Test Fixtures

Tenant Fixture Structure

Initial Test Fixtures (Week 1)

Expanded Fixtures (Week 6)

📖 Documentation Structure (Docusaurus)

Information Architecture

Search Integration

💰 Refined Cost Analysis

Profile-Based Costs (per 1000 queries)

Monthly Infrastructure (10,000 users)

🎯 Immediate Next Steps

This Week (Week 0)

Next Week (Week 1)

📞 Key Contacts & Review