LLM Provider Architecture
Understanding the multi-provider LLM architecture in RecoAgent
Overview
The LLM Provider Architecture in RecoAgent provides a unified interface for multiple Large Language Model providers with intelligent routing, automatic fallback, and cost optimization. This architecture ensures high availability, cost efficiency, and optimal performance across different LLM providers.
Architecture Components
Core Components
1. Provider Factory
The central orchestrator that manages all LLM providers.
Responsibilities:
- Initialize and manage multiple providers
- Route requests based on strategy
- Handle fallback scenarios
- Monitor provider health
- Track costs and performance
Key Classes:
ProviderFactory: Main orchestratorMultiLLMConfig: Configuration managementProviderConfig: Individual provider settings
2. Routing Engine
Intelligent routing system that selects the best provider for each request.
Routing Strategies:
- Cost-based: Select cheapest available provider
- Latency-based: Select fastest provider
- Quality-based: Select highest quality provider
- Manual: Use specified provider
3. Fallback Manager
Automatic fallback system for high availability.
Fallback Scenarios:
- Provider API failures
- Rate limit exceeded
- Timeout errors
- Cost limit exceeded
4. Health Monitor
Continuous monitoring of provider health and performance.
Monitoring Metrics:
- Response times
- Success rates
- Error rates
- Cost per request
- Quality scores
Provider Support
Supported Providers
| Provider | Models | Cost (per 1K tokens) | Typical Latency | Quality |
|---|---|---|---|---|
| OpenAI | GPT-4, GPT-3.5 | $0.01-0.03 | 1.5s | High |
| Anthropic | Claude-3 Opus, Sonnet | $0.015-0.075 | 2.0s | Very High |
| Gemini Pro, Ultra | $0.0005-0.002 | 1.0s | High | |
| Groq | Llama-3.3, Mixtral | Free-$0.001 | 0.5s | Good |
Provider Configuration
from packages.llm import MultiLLMConfig, AnthropicConfig, GoogleConfig
# Complete configuration
config = MultiLLMConfig(
primary_provider="openai",
routing_strategy="cost",
enable_fallback=True,
fallback_providers=["anthropic", "google", "groq"],
# Provider-specific configs
anthropic=AnthropicConfig(
api_key="your-anthropic-key",
model="claude-3-opus-20240229",
temperature=0.1,
max_tokens=2000
),
google=GoogleConfig(
api_key="your-google-key",
model="gemini-pro",
temperature=0.1,
max_tokens=2000
),
# Cost limits
cost_limit_per_provider={
"openai": 0.10,
"anthropic": 0.12,
"google": 0.08,
"groq": 0.05
},
# Timeout and retry
timeout_seconds=30.0,
max_retries=3
)
Routing Strategies
1. Cost-Based Routing
Selects the cheapest available provider for cost optimization.
# Cost-based routing
config = MultiLLMConfig(
routing_strategy="cost",
cost_limit_per_provider={
"groq": 0.0, # Free tier
"google": 0.0005, # Cheapest paid
"openai": 0.01, # Medium cost
"anthropic": 0.015 # Highest cost
}
)
factory = ProviderFactory(config)
llm = factory.get_provider() # Will select Groq (free) if available
Use Cases:
- High-volume applications
- Cost-sensitive deployments
- Development and testing
2. Latency-Based Routing
Selects the fastest provider for low-latency requirements.
# Latency-based routing
config = MultiLLMConfig(
routing_strategy="latency"
)
factory = ProviderFactory(config)
llm = factory.get_provider() # Will select Groq (fastest) if available
Use Cases:
- Real-time applications
- Interactive chatbots
- User-facing applications
3. Quality-Based Routing
Selects the highest quality provider for best results.
# Quality-based routing
config = MultiLLMConfig(
routing_strategy="quality"
)
factory = ProviderFactory(config)
llm = factory.get_provider() # Will select Anthropic (highest quality)
Use Cases:
- Critical business applications
- Content generation
- Complex reasoning tasks
4. Manual Routing
Uses a specified provider for predictable behavior.
# Manual routing
config = MultiLLMConfig(
routing_strategy="manual",
primary_provider="anthropic"
)
factory = ProviderFactory(config)
llm = factory.get_provider() # Will always use Anthropic
Use Cases:
- Specific model requirements
- Compliance requirements
- A/B testing
Fallback System
Automatic Fallback
The system automatically falls back to alternative providers when the primary provider fails.
# Fallback configuration
config = MultiLLMConfig(
primary_provider="openai",
enable_fallback=True,
fallback_providers=["anthropic", "google", "groq"],
max_retries=3
)
factory = ProviderFactory(config)
# This will try OpenAI first, then fallback to others if needed
llm = factory.get_provider()
response = llm.invoke("What is RAG?")
Fallback Scenarios
API Failures:
- Network timeouts
- Service unavailable
- Authentication errors
- Rate limit exceeded
Cost Limits:
- Daily cost exceeded
- Per-request cost exceeded
- Budget constraints
Quality Issues:
- Response quality below threshold
- Inappropriate content
- Hallucination detection
Fallback Chain
Cost Management
Cost Tracking
Real-time cost tracking across all providers.
# Get cost information
factory = ProviderFactory(config)
# Check provider costs
for provider in factory.list_providers():
info = factory.get_provider_info(provider)
print(f"{provider}: ${info['cost_per_1k']:.4f} per 1K tokens")
# Track request costs
llm = factory.get_provider("openai")
response = llm.invoke("What is machine learning?")
# Cost is automatically tracked
cost = factory.get_request_cost("openai", response.usage_metadata)
print(f"Request cost: ${cost:.4f}")
Cost Limits
Set cost limits to prevent budget overruns.
# Cost limits configuration
config = MultiLLMConfig(
cost_limit_per_provider={
"openai": 0.10, # $0.10 per request max
"anthropic": 0.12, # $0.12 per request max
"google": 0.08, # $0.08 per request max
"groq": 0.05 # $0.05 per request max
}
)
# Daily cost limits
daily_limits = {
"openai": 50.0, # $50 per day
"anthropic": 60.0, # $60 per day
"google": 40.0, # $40 per day
"groq": 20.0 # $20 per day
}
Cost Optimization
Automatic cost optimization strategies.
# Cost optimization
class CostOptimizer:
def __init__(self, factory: ProviderFactory):
self.factory = factory
self.usage_stats = {}
def optimize_for_cost(self, query: str) -> str:
"""Select provider based on query complexity and cost."""
# Simple queries -> cheaper providers
if len(query.split()) < 10:
return self.factory.get_cheapest_provider()
# Complex queries -> quality providers
else:
return self.factory.get_best_quality_provider()
def track_usage(self, provider: str, cost: float):
"""Track usage for cost optimization."""
if provider not in self.usage_stats:
self.usage_stats[provider] = {"total_cost": 0, "requests": 0}
self.usage_stats[provider]["total_cost"] += cost
self.usage_stats[provider]["requests"] += 1
Health Monitoring
Provider Health Checks
Continuous monitoring of provider health and performance.
# Health monitoring
factory = ProviderFactory(config)
# Check health of all providers
health_status = factory.health_check()
for provider, status in health_status.items():
if status["status"] == "healthy":
print(f"✅ {provider}: {status['model']}")
else:
print(f"❌ {provider}: {status['error']}")
# Get detailed metrics
metrics = factory.get_provider_metrics()
print(f"Average latency: {metrics['avg_latency']:.2f}s")
print(f"Success rate: {metrics['success_rate']:.2%}")
print(f"Total cost: ${metrics['total_cost']:.2f}")
Performance Metrics
Track key performance indicators across providers.
# Performance metrics
class ProviderMetrics:
def __init__(self):
self.metrics = {
"response_times": {},
"success_rates": {},
"error_rates": {},
"costs": {},
"quality_scores": {}
}
def record_request(self, provider: str, response_time: float,
success: bool, cost: float, quality: float = None):
"""Record request metrics."""
# Response time
if provider not in self.metrics["response_times"]:
self.metrics["response_times"][provider] = []
self.metrics["response_times"][provider].append(response_time)
# Success rate
if provider not in self.metrics["success_rates"]:
self.metrics["success_rates"][provider] = {"success": 0, "total": 0}
self.metrics["success_rates"][provider]["total"] += 1
if success:
self.metrics["success_rates"][provider]["success"] += 1
# Cost
if provider not in self.metrics["costs"]:
self.metrics["costs"][provider] = 0
self.metrics["costs"][provider] += cost
# Quality
if quality is not None:
if provider not in self.metrics["quality_scores"]:
self.metrics["quality_scores"][provider] = []
self.metrics["quality_scores"][provider].append(quality)
def get_provider_summary(self, provider: str) -> Dict[str, Any]:
"""Get summary metrics for a provider."""
if provider not in self.metrics["response_times"]:
return {"error": "No data available"}
response_times = self.metrics["response_times"][provider]
success_data = self.metrics["success_rates"][provider]
return {
"avg_response_time": sum(response_times) / len(response_times),
"success_rate": success_data["success"] / success_data["total"],
"total_cost": self.metrics["costs"].get(provider, 0),
"total_requests": success_data["total"]
}
Integration Patterns
1. RAG System Integration
from packages.llm import ProviderFactory, MultiLLMConfig
from packages.rag import RAGAgent
# Configure multi-provider LLM
config = MultiLLMConfig(
routing_strategy="quality", # Use best quality for RAG
enable_fallback=True
)
factory = ProviderFactory(config)
# Create RAG agent with multi-provider LLM
rag_agent = RAGAgent(
llm_factory=factory,
vector_store=vector_store,
retriever=retriever
)
# RAG will automatically use the best available provider
response = await rag_agent.process_query("What is machine learning?")
2. LangGraph Agent Integration
from packages.llm import ProviderFactory
from packages.agents import RAGAgentGraph
# Multi-provider agent
factory = ProviderFactory(config)
# Create agent with provider factory
agent = RAGAgentGraph(
llm_factory=factory,
tools=tool_registry
)
# Agent will route to appropriate provider based on task
result = await agent.run(
query="Analyze this document and provide insights",
context={"document": "..."}
)
3. Chatbot Integration
from packages.llm import ProviderFactory
from packages.conversational import DialogueManager
# Multi-provider chatbot
factory = ProviderFactory(config)
class MultiProviderChatbot:
def __init__(self):
self.dialogue_manager = DialogueManager()
self.llm_factory = factory
async def process_message(self, message: str, context: dict):
# Select provider based on message complexity
if self._is_simple_query(message):
llm = self.llm_factory.get_cheapest_provider()
else:
llm = self.llm_factory.get_best_quality_provider()
# Process with selected provider
response = await llm.ainvoke(message)
return response
def _is_simple_query(self, message: str) -> bool:
"""Determine if query is simple enough for cheaper provider."""
return len(message.split()) < 20 and "?" in message
Advanced Features
1. Dynamic Provider Selection
class DynamicProviderSelector:
def __init__(self, factory: ProviderFactory):
self.factory = factory
self.performance_history = {}
def select_provider(self, query: str, context: dict) -> str:
"""Dynamically select provider based on query and context."""
# Analyze query characteristics
query_complexity = self._analyze_complexity(query)
context_size = len(str(context))
# Select based on characteristics
if query_complexity == "simple" and context_size < 1000:
return self.factory.get_cheapest_provider()
elif query_complexity == "complex" or context_size > 5000:
return self.factory.get_best_quality_provider()
else:
return self.factory.get_fastest_provider()
def _analyze_complexity(self, query: str) -> str:
"""Analyze query complexity."""
word_count = len(query.split())
has_technical_terms = any(term in query.lower() for term in
["analyze", "compare", "explain", "evaluate"])
if word_count < 10 and not has_technical_terms:
return "simple"
elif word_count > 50 or has_technical_terms:
return "complex"
else:
return "medium"
2. A/B Testing
class ProviderABTesting:
def __init__(self, factory: ProviderFactory):
self.factory = factory
self.test_groups = {}
self.results = {}
def assign_to_group(self, user_id: str) -> str:
"""Assign user to A/B test group."""
if user_id not in self.test_groups:
# Random assignment
self.test_groups[user_id] = "A" if hash(user_id) % 2 == 0 else "B"
return self.test_groups[user_id]
def get_provider_for_user(self, user_id: str) -> str:
"""Get provider based on user's test group."""
group = self.assign_to_group(user_id)
if group == "A":
return self.factory.get_provider("openai")
else:
return self.factory.get_provider("anthropic")
def record_result(self, user_id: str, provider: str,
response_time: float, quality_score: float):
"""Record A/B test result."""
group = self.test_groups[user_id]
if group not in self.results:
self.results[group] = {
"response_times": [],
"quality_scores": [],
"provider": provider
}
self.results[group]["response_times"].append(response_time)
self.results[group]["quality_scores"].append(quality_score)
def get_test_results(self) -> Dict[str, Any]:
"""Get A/B test results."""
results = {}
for group, data in self.results.items():
results[group] = {
"avg_response_time": sum(data["response_times"]) / len(data["response_times"]),
"avg_quality_score": sum(data["quality_scores"]) / len(data["quality_scores"]),
"provider": data["provider"],
"sample_size": len(data["response_times"])
}
return results
3. Load Balancing
class ProviderLoadBalancer:
def __init__(self, factory: ProviderFactory):
self.factory = factory
self.active_requests = {}
self.request_queues = {}
def get_provider_with_load_balancing(self) -> str:
"""Get provider with load balancing."""
available_providers = self.factory.list_providers()
# Calculate load for each provider
provider_loads = {}
for provider in available_providers:
active = self.active_requests.get(provider, 0)
queued = len(self.request_queues.get(provider, []))
provider_loads[provider] = active + queued
# Select provider with lowest load
if provider_loads:
return min(provider_loads, key=provider_loads.get)
else:
return available_providers[0]
def start_request(self, provider: str, request_id: str):
"""Track request start."""
if provider not in self.active_requests:
self.active_requests[provider] = 0
self.active_requests[provider] += 1
def end_request(self, provider: str, request_id: str):
"""Track request end."""
if provider in self.active_requests:
self.active_requests[provider] = max(0, self.active_requests[provider] - 1)
Best Practices
1. Provider Selection
- Use cost-based routing for high-volume applications
- Use quality-based routing for critical business logic
- Use latency-based routing for real-time applications
- Implement fallback for high availability
2. Cost Management
- Set cost limits to prevent budget overruns
- Monitor usage across all providers
- Optimize for cost when quality requirements are met
- Use free tiers for development and testing
3. Performance Optimization
- Cache responses for repeated queries
- Batch requests when possible
- Monitor latency and optimize routing
- Use appropriate models for task complexity
4. Error Handling
- Implement retry logic with exponential backoff
- Handle rate limits gracefully
- Monitor provider health continuously
- Provide fallback responses when all providers fail
Monitoring and Alerting
Key Metrics to Monitor
Performance Metrics:
- Response time per provider
- Success rate per provider
- Error rate per provider
- Throughput per provider
Cost Metrics:
- Cost per request per provider
- Daily/monthly costs per provider
- Cost per successful request
- Budget utilization
Quality Metrics:
- Response quality scores
- User satisfaction ratings
- Task completion rates
- Error types and frequencies
Alerting Rules
# Example alerting rules
alerting_rules = {
"high_error_rate": {
"condition": "error_rate > 0.05",
"duration": "5m",
"severity": "warning"
},
"cost_limit_exceeded": {
"condition": "daily_cost > daily_limit * 0.8",
"duration": "1m",
"severity": "critical"
},
"provider_down": {
"condition": "success_rate == 0",
"duration": "2m",
"severity": "critical"
},
"high_latency": {
"condition": "avg_response_time > 10s",
"duration": "5m",
"severity": "warning"
}
}
Future Enhancements
Planned Features
- Model-specific routing: Route to specific models based on task type
- Quality prediction: Predict response quality before sending request
- Cost prediction: Predict request cost before sending
- Auto-scaling: Automatically adjust provider usage based on load
- Custom providers: Support for custom/private LLM providers
Research Areas
- Adaptive routing: Learn optimal routing from usage patterns
- Quality assessment: Real-time quality scoring
- Cost optimization: Advanced cost optimization algorithms
- Performance prediction: Predict performance before request
- Multi-modal support: Support for vision and audio models
This architecture provides a robust, scalable, and cost-effective foundation for using multiple LLM providers in production applications.