Implement Caching for Performance & Cost Savings
Difficulty: ⭐⭐ Intermediate | Time: 1 hour
🎯 The Problem
You're making expensive LLM calls for repeated questions, wasting money and time. Same query from different users = same API call = unnecessary costs. Response times are slow (2-3s) when they could be instant for cached queries.
This guide solves: Implementing multi-layer caching (response cache, embedding cache, semantic cache) to reduce costs by 30-50% and improve response times by 10x for repeated queries.
⚡ TL;DR - Quick Caching
from packages.caching import ResponseCache, EmbeddingCache
# 1. Enable response caching
response_cache = ResponseCache(
backend="redis",
redis_url="redis://localhost:6379",
ttl=3600 # Cache for 1 hour
)
# 2. Enable embedding caching
embedding_cache = EmbeddingCache(ttl=86400) # Cache for 24 hours
# 3. Use with agent
@app.post("/api/query")
async def query(request: QueryRequest):
# Check cache first
cached = await response_cache.get(request.query)
if cached:
return cached # Instant response! (10ms vs 2000ms)
# Generate fresh response
result = await agent.run(request.query)
# Cache for next time
await response_cache.set(request.query, result)
return result
# Expected savings: 30-50% cost reduction, 10x faster for cache hits
Impact: $15,000/month → $8,000/month in LLM costs!
Full Caching Guide
Caching Strategy Overview
Cache Layer 1: Response Cache
Purpose: Cache exact query-response pairs
Hit Rate: 20-30% for common queries
Savings: $0.025 → $0 per cached query
Speed: 2000ms → 10ms (200x faster)
Implementation
from packages.caching import ResponseCache
import hashlib
class ResponseCache:
def __init__(self, redis_client, ttl=3600):
self.redis = redis_client
self.ttl = ttl
def _make_key(self, query: str, user_context: dict = None) -> str:
"""Create cache key"""
# Include user context for personalized caching
cache_input = f"{query}_{user_context.get('domain', 'general')}"
return f"response:{hashlib.md5(cache_input.encode()).hexdigest()}"
async def get(self, query: str, user_context: dict = None):
"""Get cached response"""
key = self._make_key(query, user_context)
cached = await self.redis.get(key)
if cached:
return json.loads(cached)
return None
async def set(self, query: str, response: dict, user_context: dict = None):
"""Cache response"""
key = self._make_key(query, user_context)
await self.redis.setex(
key,
self.ttl,
json.dumps(response)
)
Cache Layer 2: Semantic Cache
Purpose: Return cached responses for similar queries
Hit Rate: 40-50% when combined with exact cache
Savings: Significant (avoids LLM calls)
Speed: ~50ms (40x faster)
Implementation
from packages.caching import SemanticCache
class SemanticCache:
def __init__(self, vector_store, similarity_threshold=0.95):
self.vector_store = vector_store
self.similarity_threshold = similarity_threshold
async def get(self, query: str):
"""Find semantically similar cached query"""
# Search for similar queries
similar = await self.vector_store.search(
query=query,
index="cached_queries",
k=1
)
if similar and similar[0].score >= self.similarity_threshold:
# Found similar enough query
cached_query_id = similar[0].metadata['query_id']
return await self.get_response_by_id(cached_query_id)
return None
async def set(self, query: str, response: dict):
"""Cache query and response"""
query_id = generate_id()
# Store response
await self.storage.set(f"response:{query_id}", response)
# Index query for semantic search
await self.vector_store.add_document(
id=query_id,
content=query,
metadata={"query_id": query_id, "type": "cached_query"}
)
Cache Layer 3: Embedding Cache
Purpose: Cache expensive embedding computations
Hit Rate: 60-70% for repeated documents
Savings: $0.005 per cached embedding
Speed: 200ms → 5ms (40x faster)
Implementation
from packages.caching import EmbeddingCache
class EmbeddingCache:
def __init__(self, redis_client, ttl=86400): # 24 hours
self.redis = redis_client
self.ttl = ttl
async def get_embedding(self, text: str, model: str):
"""Get cached embedding"""
key = f"emb:{model}:{hashlib.md5(text.encode()).hexdigest()}"
cached = await self.redis.get(key)
if cached:
return json.loads(cached)
# Generate fresh embedding
embedding = await generate_embedding(text, model)
# Cache it
await self.redis.setex(key, self.ttl, json.dumps(embedding))
return embedding
Complete Caching Setup
from packages.caching import MultiLayerCache
import redis.asyncio as redis
# Initialize Redis
redis_client = await redis.from_url("redis://localhost:6379")
# Set up multi-layer caching
cache = MultiLayerCache(
redis_client=redis_client,
response_ttl=3600, # 1 hour for responses
embedding_ttl=86400, # 24 hours for embeddings
semantic_threshold=0.95 # 95% similarity for semantic cache
)
# Use in your API
@app.post("/api/query")
async def query(request: QueryRequest):
# Try all cache layers
cached_response = await cache.get(request.query)
if cached_response:
return {
**cached_response,
"cached": True,
"latency_ms": cached_response.get("cache_latency", 10)
}
# Generate fresh response
result = await agent.run(request.query)
# Cache for future requests
await cache.set(request.query, result)
return result
Cache Invalidation
Time-Based Invalidation
# Automatic TTL expiration
cache.set(query, response, ttl=3600) # Expires in 1 hour
Manual Invalidation
# Invalidate when content updates
async def update_knowledge_base(new_docs):
"""Clear cache when KB updates"""
await vector_store.add_documents(new_docs)
# Clear response cache (knowledge changed)
await response_cache.clear()
# Keep embedding cache (embeddings still valid)
print("✅ Knowledge base updated, response cache cleared")
Selective Invalidation
# Invalidate specific patterns
await cache.delete_pattern("response:*medical*") # Clear medical responses
await cache.delete_pattern("response:*outdated_product*") # Clear specific products
Monitoring Cache Performance
# Cache metrics
metrics = await cache.get_stats()
print(f"""
Cache Performance:
Hit Rate: {metrics['hit_rate']:.1%} (target: >40%)
Total Hits: {metrics['hits']:,}
Total Misses: {metrics['misses']:,}
Avg Hit Time: {metrics['avg_hit_latency']}ms
Cost Saved: ${metrics['cost_saved']:.2f}
Response Cache: {metrics['response_hit_rate']:.1%} hit rate
Semantic Cache: {metrics['semantic_hit_rate']:.1%} hit rate
Embedding Cache: {metrics['embedding_hit_rate']:.1%} hit rate
""")
Cost Impact Analysis
Before Caching
10,000 queries/day × $0.025/query = $250/day = $7,500/month
Average latency: 2000ms
After Caching (40% hit rate)
6,000 fresh queries × $0.025 = $150/day
4,000 cached queries × $0 = $0/day
Total: $150/day = $4,500/month
Savings: $3,000/month (40%)
Average latency: 1240ms (38% faster)
- 40% at 10ms (cached)
- 60% at 2000ms (fresh)
Best Practices
Practice | Why | Implementation |
---|---|---|
Cache by user tier | Different TTLs for different users | Free: 10min, Pro: 1hour, Enterprise: 24hours |
Monitor hit rates | Optimize cache strategy | Alert if hit rate < 30% |
Warm cache | Preload common queries | Background job for FAQ |
Set max cache size | Prevent memory issues | LRU eviction at 10GB |
Use compression | Store more in same space | gzip responses |
Version cache keys | Invalidate on code changes | Include version in key |
Troubleshooting
Problem | Cause | Solution |
---|---|---|
Low hit rate (<20%) | Queries too unique | Implement semantic cache |
Stale responses | TTL too long | Reduce TTL or manual invalidation |
High memory usage | Cache too large | Set max size, use LRU eviction |
Cache thrashing | Eviction too frequent | Increase cache size |
Slow cache lookups | Network latency | Use local Redis or connection pooling |
What You've Accomplished
✅ Implemented multi-layer caching (response, semantic, embedding)
✅ Reduced costs by 30-50% through caching
✅ Improved latency 10x for cache hits
✅ Set up cache monitoring and metrics
✅ Configured intelligent cache invalidation
Next Steps
- 💰 Cost Optimization - More cost-saving strategies
- 🚀 Deploy to Production - Deploy with caching
- 📊 Monitor Performance - Track cache effectiveness