API Integration & Production Features
Week 2 enhancements: REST API, caching, monitoring, and existing component integration
Overview
Week 2 adds production-ready features on top of the Week 0-1 foundation:
✅ REST API - FastAPI endpoints for all operations
✅ Multi-Level Caching - 40-60% cost reduction
✅ Query Expansion - Integrated from existing query_expansion.py
✅ Faceted Search - Integrated from existing faceted_search.py
✅ Metrics & Monitoring - Prometheus integration
Key Principle: Reuse existing components, add orchestration layer
REST API Endpoints
API Design Philosophy
RESTful Design: Resources (documents) with actions (search, summarize)
Base Path: /api/v1/documents
Endpoint 1: Upload Document
Purpose: Upload and automatically index a document
POST /api/v1/documents/upload
Content-Type: multipart/form-data
Parameters:
- file: File (required) - PDF, DOCX, XLSX, etc.
- metadata: JSON (optional) - Additional metadata
Response:
{
"document_id": "a1b2c3d4",
"filename": "report.pdf",
"format": "pdf",
"size_bytes": 2048576,
"status": "indexed",
"chunks_created": 45,
"s3_key": "documents/2025/10/a1b2c3d4/report.pdf",
"processing_time_ms": 8500
}
What Happens Behind the Scenes:
Upload → Save temp file → DocumentLoader (REUSED) →
S3 upload (NEW) → SemanticChunker (REUSED) →
Generate embeddings → VectorStore.add_documents (REUSED)
Example:
import requests
files = {'file': open('report.pdf', 'rb')}
metadata = {'category': 'finance', 'year': 2025}
response = requests.post(
'http://localhost:8000/api/v1/documents/upload',
files=files,
data={'metadata': json.dumps(metadata)}
)
print(response.json()['document_id'])
Endpoint 2: Search Documents
Purpose: Hybrid search with optional summarization
POST /api/v1/documents/search
Content-Type: application/json
Body:
{
"query": "revenue growth Q3",
"profile": "balanced",
"filters": {"category": "finance"},
"limit": 10,
"include_summary": true
}
Response:
{
"query": "revenue growth Q3",
"total_results": 45,
"results": [
{
"chunk_id": "a1b2c3d4_chunk_0",
"document_id": "a1b2c3d4",
"content": "Q3 revenue increased by 25%...",
"score": 0.89,
"metadata": {
"document_title": "Q3 Report",
"page_number": 5
}
}
],
"summary": {
"text": "Revenue grew 25% in Q3 [1]. Driven by new markets [2].",
"citations": {
"1": {"document_title": "Q3 Report", "snippet": "..."},
"2": {"document_title": "Strategy Doc", "snippet": "..."}
},
"faithfulness": 0.92,
"coverage": 0.85
},
"timing": {
"retrieval_ms": 150,
"reranking_ms": 80,
"summarization_ms": 120,
"total_ms": 350
},
"profile": "balanced",
"slo_met": true
}
What Happens:
Query → [Cache Check] → HybridRetriever (REUSED) →
CrossEncoderReranker (REUSED) → GroundedSummarizer (NEW) →
[Cache Store] → Response
Cache Hit:
- L1 cache hit: ~5ms (99% faster!)
- L2 cache hit (retrieval): ~100ms (skip retrieval, only summarize)
Endpoint 3: Summarize Document
Purpose: Generate summary for specific document
POST /api/v1/documents/{document_id}/summarize
Body:
{
"query": "What were the key findings?",
"mode": "extractive",
"max_length": 250
}
Modes:
extractive
: Fast (50-200ms), faithful (100%), freeabstractive
: Fluent, comprehensive, LLM-based ($0.01-0.10)
Endpoint 4: Find Similar Documents
Purpose: Content-based recommendations
GET /api/v1/documents/{document_id}/similar?limit=5
Use Cases:
- "Related documents" feature
- Content recommendations
- Duplicate detection
Endpoint 5: Delete Document
DELETE /api/v1/documents/{document_id}
What Gets Deleted:
- Chunks from vector store
- Raw file from S3
- Metadata from MongoDB
- Related caches
Multi-Level Caching
Caching Theory
Fundamental Principle: Avoid recomputing what hasn't changed.
The 80/20 Rule in Search:
- 20% of queries account for 80% of traffic
- These hot queries should be cached
Cache Hit Benefits:
Without cache:
Query → Embed → Search (150ms) → Rerank (80ms) → Summarize (120ms)
= 350ms
With L1 cache hit:
Query → Redis lookup → Return cached result
= 5ms (70x faster!)
With L2 cache hit (retrieval cached):
Query → Redis lookup (retrieval) → Summarize (120ms)
= 125ms (2.8x faster)
Four-Level Cache Architecture
Query arrives
↓
┌─────────────────────────────┐
│ L1: Full Result Cache │
│ Key: query + filters + │
│ profile │
│ TTL: 1 hour │
│ Hit: Return immediately (5ms)│
└──────────┬──────────────────┘
↓ Miss
┌─────────────────────────────┐
│ L2: Retrieval Cache │
│ Key: query + topK │
│ TTL: 1 hour │
│ Hit: Skip to summarize (125ms)│
└──────────┬──────────────────┘
↓ Miss
┌─────────────────────────────┐
│ L3: Summary Cache │