Cache Performance Tuning Guide
This guide provides comprehensive strategies for optimizing cache performance, including memory management, eviction policies, compression settings, and monitoring techniques.
Table of Contents
- Performance Metrics
- Memory Optimization
- Eviction Policy Tuning
- Compression Optimization
- Semantic Matching Tuning
- Cache Warming Optimization
- Distributed Cache Tuning
- Monitoring and Alerting
- Performance Testing
- Troubleshooting Performance Issues
Performance Metrics
Key Performance Indicators (KPIs)
Monitor these critical metrics to assess cache performance:
Hit Rate Metrics
- Overall Hit Rate: Percentage of requests served from cache
- Layer Hit Rates: Hit rates for each cache layer (embedding, search, LLM)
- Semantic Hit Rate: Percentage of semantic matches vs exact matches
- User-Specific Hit Rate: Hit rates per user or user segment
Response Time Metrics
- Average Response Time: Mean time to serve cached content
- P95 Response Time: 95th percentile response time
- P99 Response Time: 99th percentile response time
- Cache Miss Penalty: Time difference between hits and misses
Memory Metrics
- Memory Usage: Current memory consumption
- Memory Efficiency: Ratio of useful data to total memory
- Fragmentation Ratio: Memory fragmentation level
- Eviction Rate: Frequency of cache evictions
Throughput Metrics
- Requests Per Second: Cache request throughput
- Operations Per Second: Cache operations throughput
- Warming Throughput: Cache warming operation rate
- Compression Ratio: Data compression effectiveness
Performance Targets
Metric | Target | Warning | Critical |
---|---|---|---|
Overall Hit Rate | >85% | <80% | <70% |
Average Response Time | <50ms | >100ms | >200ms |
P95 Response Time | <100ms | >200ms | >500ms |
Memory Usage | <80% | >90% | >95% |
Eviction Rate | <5/min | >10/min | >20/min |
Semantic Hit Rate | >60% | <50% | <40% |
Memory Optimization
Memory Allocation Strategy
from packages.caching.optimization import MemoryOptimizer, OptimizationConfig
# Configure memory optimization
memory_config = OptimizationConfig(
memory_limit_bytes=2 * 1024 * 1024 * 1024, # 2GB
memory_cleanup_interval=300, # 5 minutes
memory_pressure_threshold=0.8 # 80%
)
memory_optimizer = MemoryOptimizer(memory_config)
# Optimize cache entries
optimized_entries = memory_optimizer.optimize_memory_usage(cache_entries)
Memory Pressure Detection
# Check memory pressure
current_usage = await cache_manager.get_stats(CacheLayer.EMBEDDING).total_size_bytes
is_pressure = memory_optimizer.check_memory_pressure(current_usage)
if is_pressure:
# Trigger aggressive eviction
await cache_manager.clear_old_entries()
# Enable compression for new entries
config.compression_enabled = True
Memory Profiling
# Get detailed memory statistics
memory_stats = memory_optimizer.get_memory_stats()
print(f"Total allocated: {memory_stats['total_allocated'] / (1024*1024):.2f} MB")
print(f"Total freed: {memory_stats['total_freed'] / (1024*1024):.2f} MB")
print(f"Current usage: {memory_stats['current_usage'] / (1024*1024):.2f} MB")
print(f"Peak usage: {memory_stats['peak_usage'] / (1024*1024):.2f} MB")
print(f"Fragmentation: {memory_stats['fragmentation_ratio']:.2%}")
Eviction Policy Tuning
Policy Selection Guide
Policy | Best For | Pros | Cons |
---|---|---|---|
LRU | General purpose, predictable access patterns | Simple, effective for temporal locality | Doesn't consider frequency or cost |
LFU | Frequently accessed content, stable workloads | Good for popular content | Can get stuck with old popular content |
TTL | Time-sensitive data, compliance requirements | Automatic expiration, predictable | May evict useful content |
Hybrid | Complex workloads, balanced performance | Combines multiple factors | More complex to tune |
Size-based | Memory-constrained environments | Maximizes memory efficiency | May evict important large items |
Cost-based | Compute-intensive applications | Considers computation cost | Requires cost estimation |
Hybrid Policy Tuning
from packages.caching.optimization import EvictionPolicy, EvictionPolicyManager
# Configure hybrid eviction with custom weights
eviction_config = OptimizationConfig(
eviction_policy=EvictionPolicy.HYBRID,
eviction_threshold=0.85, # Start evicting at 85%
eviction_batch_size=50 # Evict 50 entries at a time
)
eviction_manager = EvictionPolicyManager(eviction_config)
# Custom hybrid scoring (in the implementation)
# LRU factor: 30% weight
# LFU factor: 25% weight
# TTL factor: 20% weight
# Size factor: 15% weight
# Age factor: 10% weight
Dynamic Eviction Tuning
# Adjust eviction threshold based on performance
def adjust_eviction_threshold(cache_stats):
hit_rate = cache_stats.hit_rate
memory_usage = cache_stats.total_size_bytes / config.max_size_bytes
if hit_rate < 0.8 and memory_usage > 0.7:
# Low hit rate, high memory usage - be more aggressive
return 0.7
elif hit_rate > 0.9 and memory_usage < 0.6:
# High hit rate, low memory usage - be less aggressive
return 0.9
else:
# Default threshold
return 0.8
# Apply dynamic threshold
current_threshold = adjust_eviction_threshold(stats)
config.eviction_threshold = current_threshold
Compression Optimization
Compression Algorithm Selection
Algorithm | Compression Ratio | Speed | CPU Usage | Best For |
---|---|---|---|---|
GZIP | High | Medium | Medium | General purpose |
ZLIB | High | Medium | Medium | Similar to GZIP |
LZ4 | Medium | Very Fast | Low | High-throughput applications |
None | N/A | Fastest | None | Small data, fast access |
Compression Configuration
from packages.caching.optimization import CompressionEngine, CompressionAlgorithm
# High compression for storage efficiency
high_compression_config = OptimizationConfig(
compression_enabled=True,
compression_threshold_bytes=512, # Compress smaller items
compression_algorithm=CompressionAlgorithm.GZIP,
compression_level=9 # Maximum compression
)
# Fast compression for performance
fast_compression_config = OptimizationConfig(
compression_enabled=True,
compression_threshold_bytes=2048, # Only compress larger items
compression_algorithm=CompressionAlgorithm.LZ4,
compression_level=1 # Fast compression
)
compression_engine = CompressionEngine(fast_compression_config)
Compression Performance Monitoring
# Monitor compression effectiveness
compression_stats = compression_engine.get_stats()
print(f"Total compressed: {compression_stats['total_compressed']}")
print(f"Total decompressed: {compression_stats['total_decompressed']}")
print(f"Average compression ratio: {compression_stats['compression_ratio']:.2%}")
print(f"Time saved: {compression_stats['time_saved_ms']:.2f} ms")
Semantic Matching Tuning
Similarity Threshold Optimization
from packages.caching.semantic import SemanticMatcher, SimilarityMetric
# Test different thresholds
thresholds = [0.7, 0.75, 0.8, 0.85, 0.9, 0.95]
results = []
for threshold in thresholds:
matcher = SemanticMatcher(
similarity_threshold=threshold,
max_candidates=10
)
# Test with your data
hit_rate = test_semantic_matching(matcher, test_queries)
precision = calculate_precision(matcher, test_queries)
results.append({
'threshold': threshold,
'hit_rate': hit_rate,
'precision': precision,
'f1_score': 2 * (hit_rate * precision) / (hit_rate + precision)
})
# Find optimal threshold
best_result = max(results, key=lambda x: x['f1_score'])
optimal_threshold = best_result['threshold']
Similarity Metric Selection
# Test different similarity metrics
metrics = [
SimilarityMetric.COSINE,
SimilarityMetric.EUCLIDEAN,
SimilarityMetric.DOT_PRODUCT,
SimilarityMetric.MANHATTAN
]
for metric in metrics:
matcher = SemanticMatcher(
similarity_threshold=0.85,
max_candidates=10,
metric=metric
)
# Test performance with your data
performance = test_similarity_metric(matcher, test_embeddings)
print(f"{metric.value}: {performance}")
Candidate Selection Tuning
# Optimize number of candidates
candidate_counts = [3, 5, 10, 15, 20]
results = []
for max_candidates in candidate_counts:
matcher = SemanticMatcher(
similarity_threshold=0.85,
max_candidates=max_candidates
)
# Test with your data
hit_rate = test_candidate_selection(matcher, test_queries)
response_time = measure_response_time(matcher, test_queries)
results.append({
'max_candidates': max_candidates,
'hit_rate': hit_rate,
'response_time': response_time,
'efficiency': hit_rate / response_time # Hit rate per ms
})
# Find optimal candidate count
best_result = max(results, key=lambda x: x['efficiency'])
optimal_candidates = best_result['max_candidates']
Cache Warming Optimization
Warming Strategy Tuning
from packages.caching.warming import WarmingStrategy, PredictiveWarmer
# Test different warming strategies
strategies = [
WarmingStrategy(
enabled=True,
batch_size=50,
interval_seconds=1800, # 30 minutes
priority_threshold=0.8
),
WarmingStrategy(
enabled=True,
batch_size=100,
interval_seconds=900, # 15 minutes
priority_threshold=0.7
),
WarmingStrategy(
enabled=True,
batch_size=200,
interval_seconds=3600, # 1 hour
priority_threshold=0.6
)
]
for strategy in strategies:
warmer = PredictiveWarmer(config)
warmer.add_warming_strategy("test", strategy)
# Test warming effectiveness
effectiveness = test_warming_strategy(warmer, test_queries)
print(f"Strategy {strategy.batch_size}/{strategy.interval_seconds}: {effectiveness}")
Pattern Analysis Optimization
from packages.caching.warming import QueryPatternAnalyzer
# Optimize pattern analysis window
windows = [6, 12, 24, 48, 72] # hours
results = []
for window in windows:
analyzer = QueryPatternAnalyzer(config)
# Analyze patterns with different windows
patterns = analyzer.analyze_patterns(
time_window_hours=window,
min_frequency=2
)
# Test pattern quality
quality = evaluate_pattern_quality(patterns)
results.append({
'window': window,
'quality': quality,
'pattern_count': len(patterns.get('frequent_queries', []))
})
# Find optimal window
best_result = max(results, key=lambda x: x['quality'])
optimal_window = best_result['window']