Skip to main content

Implement Caching for Performance & Cost Savings

Difficulty: ⭐⭐ Intermediate | Time: 1 hour

🎯 The Problem

You're making expensive LLM calls for repeated questions, wasting money and time. Same query from different users = same API call = unnecessary costs. Response times are slow (2-3s) when they could be instant for cached queries.

This guide solves: Implementing multi-layer caching (response cache, embedding cache, semantic cache) to reduce costs by 30-50% and improve response times by 10x for repeated queries.

⚡ TL;DR - Quick Caching

from packages.caching import ResponseCache, EmbeddingCache

# 1. Enable response caching
response_cache = ResponseCache(
backend="redis",
redis_url="redis://localhost:6379",
ttl=3600 # Cache for 1 hour
)

# 2. Enable embedding caching
embedding_cache = EmbeddingCache(ttl=86400) # Cache for 24 hours

# 3. Use with agent
@app.post("/api/query")
async def query(request: QueryRequest):
# Check cache first
cached = await response_cache.get(request.query)
if cached:
return cached # Instant response! (10ms vs 2000ms)

# Generate fresh response
result = await agent.run(request.query)

# Cache for next time
await response_cache.set(request.query, result)
return result

# Expected savings: 30-50% cost reduction, 10x faster for cache hits

Impact: $15,000/month → $8,000/month in LLM costs!


Full Caching Guide

Caching Strategy Overview

Cache Layer 1: Response Cache

Purpose: Cache exact query-response pairs

Hit Rate: 20-30% for common queries
Savings: $0.025 → $0 per cached query
Speed: 2000ms → 10ms (200x faster)

Implementation

from packages.caching import ResponseCache
import hashlib

class ResponseCache:
def __init__(self, redis_client, ttl=3600):
self.redis = redis_client
self.ttl = ttl

def _make_key(self, query: str, user_context: dict = None) -> str:
"""Create cache key"""
# Include user context for personalized caching
cache_input = f"{query}_{user_context.get('domain', 'general')}"
return f"response:{hashlib.md5(cache_input.encode()).hexdigest()}"

async def get(self, query: str, user_context: dict = None):
"""Get cached response"""
key = self._make_key(query, user_context)
cached = await self.redis.get(key)

if cached:
return json.loads(cached)
return None

async def set(self, query: str, response: dict, user_context: dict = None):
"""Cache response"""
key = self._make_key(query, user_context)
await self.redis.setex(
key,
self.ttl,
json.dumps(response)
)

Cache Layer 2: Semantic Cache

Purpose: Return cached responses for similar queries

Hit Rate: 40-50% when combined with exact cache
Savings: Significant (avoids LLM calls)
Speed: ~50ms (40x faster)

Implementation

from packages.caching import SemanticCache

class SemanticCache:
def __init__(self, vector_store, similarity_threshold=0.95):
self.vector_store = vector_store
self.similarity_threshold = similarity_threshold

async def get(self, query: str):
"""Find semantically similar cached query"""
# Search for similar queries
similar = await self.vector_store.search(
query=query,
index="cached_queries",
k=1
)

if similar and similar[0].score >= self.similarity_threshold:
# Found similar enough query
cached_query_id = similar[0].metadata['query_id']
return await self.get_response_by_id(cached_query_id)

return None

async def set(self, query: str, response: dict):
"""Cache query and response"""
query_id = generate_id()

# Store response
await self.storage.set(f"response:{query_id}", response)

# Index query for semantic search
await self.vector_store.add_document(
id=query_id,
content=query,
metadata={"query_id": query_id, "type": "cached_query"}
)

Cache Layer 3: Embedding Cache

Purpose: Cache expensive embedding computations

Hit Rate: 60-70% for repeated documents
Savings: $0.005 per cached embedding
Speed: 200ms → 5ms (40x faster)

Implementation

from packages.caching import EmbeddingCache

class EmbeddingCache:
def __init__(self, redis_client, ttl=86400): # 24 hours
self.redis = redis_client
self.ttl = ttl

async def get_embedding(self, text: str, model: str):
"""Get cached embedding"""
key = f"emb:{model}:{hashlib.md5(text.encode()).hexdigest()}"
cached = await self.redis.get(key)

if cached:
return json.loads(cached)

# Generate fresh embedding
embedding = await generate_embedding(text, model)

# Cache it
await self.redis.setex(key, self.ttl, json.dumps(embedding))

return embedding

Complete Caching Setup

from packages.caching import MultiLayerCache
import redis.asyncio as redis

# Initialize Redis
redis_client = await redis.from_url("redis://localhost:6379")

# Set up multi-layer caching
cache = MultiLayerCache(
redis_client=redis_client,
response_ttl=3600, # 1 hour for responses
embedding_ttl=86400, # 24 hours for embeddings
semantic_threshold=0.95 # 95% similarity for semantic cache
)

# Use in your API
@app.post("/api/query")
async def query(request: QueryRequest):
# Try all cache layers
cached_response = await cache.get(request.query)
if cached_response:
return {
**cached_response,
"cached": True,
"latency_ms": cached_response.get("cache_latency", 10)
}

# Generate fresh response
result = await agent.run(request.query)

# Cache for future requests
await cache.set(request.query, result)

return result

Cache Invalidation

Time-Based Invalidation

# Automatic TTL expiration
cache.set(query, response, ttl=3600) # Expires in 1 hour

Manual Invalidation

# Invalidate when content updates
async def update_knowledge_base(new_docs):
"""Clear cache when KB updates"""
await vector_store.add_documents(new_docs)

# Clear response cache (knowledge changed)
await response_cache.clear()

# Keep embedding cache (embeddings still valid)
print("✅ Knowledge base updated, response cache cleared")

Selective Invalidation

# Invalidate specific patterns
await cache.delete_pattern("response:*medical*") # Clear medical responses
await cache.delete_pattern("response:*outdated_product*") # Clear specific products

Monitoring Cache Performance

# Cache metrics
metrics = await cache.get_stats()

print(f"""
Cache Performance:
Hit Rate: {metrics['hit_rate']:.1%} (target: >40%)
Total Hits: {metrics['hits']:,}
Total Misses: {metrics['misses']:,}
Avg Hit Time: {metrics['avg_hit_latency']}ms
Cost Saved: ${metrics['cost_saved']:.2f}

Response Cache: {metrics['response_hit_rate']:.1%} hit rate
Semantic Cache: {metrics['semantic_hit_rate']:.1%} hit rate
Embedding Cache: {metrics['embedding_hit_rate']:.1%} hit rate
""")

Cost Impact Analysis

Before Caching

10,000 queries/day × $0.025/query = $250/day = $7,500/month
Average latency: 2000ms

After Caching (40% hit rate)

6,000 fresh queries × $0.025 = $150/day
4,000 cached queries × $0 = $0/day
Total: $150/day = $4,500/month

Savings: $3,000/month (40%)
Average latency: 1240ms (38% faster)
- 40% at 10ms (cached)
- 60% at 2000ms (fresh)

Best Practices

PracticeWhyImplementation
Cache by user tierDifferent TTLs for different usersFree: 10min, Pro: 1hour, Enterprise: 24hours
Monitor hit ratesOptimize cache strategyAlert if hit rate < 30%
Warm cachePreload common queriesBackground job for FAQ
Set max cache sizePrevent memory issuesLRU eviction at 10GB
Use compressionStore more in same spacegzip responses
Version cache keysInvalidate on code changesInclude version in key

Troubleshooting

ProblemCauseSolution
Low hit rate (<20%)Queries too uniqueImplement semantic cache
Stale responsesTTL too longReduce TTL or manual invalidation
High memory usageCache too largeSet max size, use LRU eviction
Cache thrashingEviction too frequentIncrease cache size
Slow cache lookupsNetwork latencyUse local Redis or connection pooling

What You've Accomplished

Implemented multi-layer caching (response, semantic, embedding)
Reduced costs by 30-50% through caching
Improved latency 10x for cache hits
Set up cache monitoring and metrics
Configured intelligent cache invalidation

Next Steps