Skip to main content

LLM Provider Architecture

Understanding the multi-provider LLM architecture in RecoAgent


Overview

The LLM Provider Architecture in RecoAgent provides a unified interface for multiple Large Language Model providers with intelligent routing, automatic fallback, and cost optimization. This architecture ensures high availability, cost efficiency, and optimal performance across different LLM providers.

Architecture Components

Core Components

1. Provider Factory

The central orchestrator that manages all LLM providers.

Responsibilities:

  • Initialize and manage multiple providers
  • Route requests based on strategy
  • Handle fallback scenarios
  • Monitor provider health
  • Track costs and performance

Key Classes:

  • ProviderFactory: Main orchestrator
  • MultiLLMConfig: Configuration management
  • ProviderConfig: Individual provider settings

2. Routing Engine

Intelligent routing system that selects the best provider for each request.

Routing Strategies:

  • Cost-based: Select cheapest available provider
  • Latency-based: Select fastest provider
  • Quality-based: Select highest quality provider
  • Manual: Use specified provider

3. Fallback Manager

Automatic fallback system for high availability.

Fallback Scenarios:

  • Provider API failures
  • Rate limit exceeded
  • Timeout errors
  • Cost limit exceeded

4. Health Monitor

Continuous monitoring of provider health and performance.

Monitoring Metrics:

  • Response times
  • Success rates
  • Error rates
  • Cost per request
  • Quality scores

Provider Support

Supported Providers

ProviderModelsCost (per 1K tokens)Typical LatencyQuality
OpenAIGPT-4, GPT-3.5$0.01-0.031.5sHigh
AnthropicClaude-3 Opus, Sonnet$0.015-0.0752.0sVery High
GoogleGemini Pro, Ultra$0.0005-0.0021.0sHigh
GroqLlama-3.3, MixtralFree-$0.0010.5sGood

Provider Configuration

from packages.llm import MultiLLMConfig, AnthropicConfig, GoogleConfig

# Complete configuration
config = MultiLLMConfig(
primary_provider="openai",
routing_strategy="cost",
enable_fallback=True,
fallback_providers=["anthropic", "google", "groq"],

# Provider-specific configs
anthropic=AnthropicConfig(
api_key="your-anthropic-key",
model="claude-3-opus-20240229",
temperature=0.1,
max_tokens=2000
),

google=GoogleConfig(
api_key="your-google-key",
model="gemini-pro",
temperature=0.1,
max_tokens=2000
),

# Cost limits
cost_limit_per_provider={
"openai": 0.10,
"anthropic": 0.12,
"google": 0.08,
"groq": 0.05
},

# Timeout and retry
timeout_seconds=30.0,
max_retries=3
)

Routing Strategies

1. Cost-Based Routing

Selects the cheapest available provider for cost optimization.

# Cost-based routing
config = MultiLLMConfig(
routing_strategy="cost",
cost_limit_per_provider={
"groq": 0.0, # Free tier
"google": 0.0005, # Cheapest paid
"openai": 0.01, # Medium cost
"anthropic": 0.015 # Highest cost
}
)

factory = ProviderFactory(config)
llm = factory.get_provider() # Will select Groq (free) if available

Use Cases:

  • High-volume applications
  • Cost-sensitive deployments
  • Development and testing

2. Latency-Based Routing

Selects the fastest provider for low-latency requirements.

# Latency-based routing
config = MultiLLMConfig(
routing_strategy="latency"
)

factory = ProviderFactory(config)
llm = factory.get_provider() # Will select Groq (fastest) if available

Use Cases:

  • Real-time applications
  • Interactive chatbots
  • User-facing applications

3. Quality-Based Routing

Selects the highest quality provider for best results.

# Quality-based routing
config = MultiLLMConfig(
routing_strategy="quality"
)

factory = ProviderFactory(config)
llm = factory.get_provider() # Will select Anthropic (highest quality)

Use Cases:

  • Critical business applications
  • Content generation
  • Complex reasoning tasks

4. Manual Routing

Uses a specified provider for predictable behavior.

# Manual routing
config = MultiLLMConfig(
routing_strategy="manual",
primary_provider="anthropic"
)

factory = ProviderFactory(config)
llm = factory.get_provider() # Will always use Anthropic

Use Cases:

  • Specific model requirements
  • Compliance requirements
  • A/B testing

Fallback System

Automatic Fallback

The system automatically falls back to alternative providers when the primary provider fails.

# Fallback configuration
config = MultiLLMConfig(
primary_provider="openai",
enable_fallback=True,
fallback_providers=["anthropic", "google", "groq"],
max_retries=3
)

factory = ProviderFactory(config)

# This will try OpenAI first, then fallback to others if needed
llm = factory.get_provider()
response = llm.invoke("What is RAG?")

Fallback Scenarios

API Failures:

  • Network timeouts
  • Service unavailable
  • Authentication errors
  • Rate limit exceeded

Cost Limits:

  • Daily cost exceeded
  • Per-request cost exceeded
  • Budget constraints

Quality Issues:

  • Response quality below threshold
  • Inappropriate content
  • Hallucination detection

Fallback Chain

Cost Management

Cost Tracking

Real-time cost tracking across all providers.

# Get cost information
factory = ProviderFactory(config)

# Check provider costs
for provider in factory.list_providers():
info = factory.get_provider_info(provider)
print(f"{provider}: ${info['cost_per_1k']:.4f} per 1K tokens")

# Track request costs
llm = factory.get_provider("openai")
response = llm.invoke("What is machine learning?")

# Cost is automatically tracked
cost = factory.get_request_cost("openai", response.usage_metadata)
print(f"Request cost: ${cost:.4f}")

Cost Limits

Set cost limits to prevent budget overruns.

# Cost limits configuration
config = MultiLLMConfig(
cost_limit_per_provider={
"openai": 0.10, # $0.10 per request max
"anthropic": 0.12, # $0.12 per request max
"google": 0.08, # $0.08 per request max
"groq": 0.05 # $0.05 per request max
}
)

# Daily cost limits
daily_limits = {
"openai": 50.0, # $50 per day
"anthropic": 60.0, # $60 per day
"google": 40.0, # $40 per day
"groq": 20.0 # $20 per day
}

Cost Optimization

Automatic cost optimization strategies.

# Cost optimization
class CostOptimizer:
def __init__(self, factory: ProviderFactory):
self.factory = factory
self.usage_stats = {}

def optimize_for_cost(self, query: str) -> str:
"""Select provider based on query complexity and cost."""

# Simple queries -> cheaper providers
if len(query.split()) < 10:
return self.factory.get_cheapest_provider()

# Complex queries -> quality providers
else:
return self.factory.get_best_quality_provider()

def track_usage(self, provider: str, cost: float):
"""Track usage for cost optimization."""
if provider not in self.usage_stats:
self.usage_stats[provider] = {"total_cost": 0, "requests": 0}

self.usage_stats[provider]["total_cost"] += cost
self.usage_stats[provider]["requests"] += 1

Health Monitoring

Provider Health Checks

Continuous monitoring of provider health and performance.

# Health monitoring
factory = ProviderFactory(config)

# Check health of all providers
health_status = factory.health_check()

for provider, status in health_status.items():
if status["status"] == "healthy":
print(f"✅ {provider}: {status['model']}")
else:
print(f"❌ {provider}: {status['error']}")

# Get detailed metrics
metrics = factory.get_provider_metrics()
print(f"Average latency: {metrics['avg_latency']:.2f}s")
print(f"Success rate: {metrics['success_rate']:.2%}")
print(f"Total cost: ${metrics['total_cost']:.2f}")

Performance Metrics

Track key performance indicators across providers.

# Performance metrics
class ProviderMetrics:
def __init__(self):
self.metrics = {
"response_times": {},
"success_rates": {},
"error_rates": {},
"costs": {},
"quality_scores": {}
}

def record_request(self, provider: str, response_time: float,
success: bool, cost: float, quality: float = None):
"""Record request metrics."""

# Response time
if provider not in self.metrics["response_times"]:
self.metrics["response_times"][provider] = []
self.metrics["response_times"][provider].append(response_time)

# Success rate
if provider not in self.metrics["success_rates"]:
self.metrics["success_rates"][provider] = {"success": 0, "total": 0}

self.metrics["success_rates"][provider]["total"] += 1
if success:
self.metrics["success_rates"][provider]["success"] += 1

# Cost
if provider not in self.metrics["costs"]:
self.metrics["costs"][provider] = 0
self.metrics["costs"][provider] += cost

# Quality
if quality is not None:
if provider not in self.metrics["quality_scores"]:
self.metrics["quality_scores"][provider] = []
self.metrics["quality_scores"][provider].append(quality)

def get_provider_summary(self, provider: str) -> Dict[str, Any]:
"""Get summary metrics for a provider."""
if provider not in self.metrics["response_times"]:
return {"error": "No data available"}

response_times = self.metrics["response_times"][provider]
success_data = self.metrics["success_rates"][provider]

return {
"avg_response_time": sum(response_times) / len(response_times),
"success_rate": success_data["success"] / success_data["total"],
"total_cost": self.metrics["costs"].get(provider, 0),
"total_requests": success_data["total"]
}

Integration Patterns

1. RAG System Integration

from packages.llm import ProviderFactory, MultiLLMConfig
from packages.rag import RAGAgent

# Configure multi-provider LLM
config = MultiLLMConfig(
routing_strategy="quality", # Use best quality for RAG
enable_fallback=True
)

factory = ProviderFactory(config)

# Create RAG agent with multi-provider LLM
rag_agent = RAGAgent(
llm_factory=factory,
vector_store=vector_store,
retriever=retriever
)

# RAG will automatically use the best available provider
response = await rag_agent.process_query("What is machine learning?")

2. LangGraph Agent Integration

from packages.llm import ProviderFactory
from packages.agents import RAGAgentGraph

# Multi-provider agent
factory = ProviderFactory(config)

# Create agent with provider factory
agent = RAGAgentGraph(
llm_factory=factory,
tools=tool_registry
)

# Agent will route to appropriate provider based on task
result = await agent.run(
query="Analyze this document and provide insights",
context={"document": "..."}
)

3. Chatbot Integration

from packages.llm import ProviderFactory
from packages.conversational import DialogueManager

# Multi-provider chatbot
factory = ProviderFactory(config)

class MultiProviderChatbot:
def __init__(self):
self.dialogue_manager = DialogueManager()
self.llm_factory = factory

async def process_message(self, message: str, context: dict):
# Select provider based on message complexity
if self._is_simple_query(message):
llm = self.llm_factory.get_cheapest_provider()
else:
llm = self.llm_factory.get_best_quality_provider()

# Process with selected provider
response = await llm.ainvoke(message)
return response

def _is_simple_query(self, message: str) -> bool:
"""Determine if query is simple enough for cheaper provider."""
return len(message.split()) < 20 and "?" in message

Advanced Features

1. Dynamic Provider Selection

class DynamicProviderSelector:
def __init__(self, factory: ProviderFactory):
self.factory = factory
self.performance_history = {}

def select_provider(self, query: str, context: dict) -> str:
"""Dynamically select provider based on query and context."""

# Analyze query characteristics
query_complexity = self._analyze_complexity(query)
context_size = len(str(context))

# Select based on characteristics
if query_complexity == "simple" and context_size < 1000:
return self.factory.get_cheapest_provider()
elif query_complexity == "complex" or context_size > 5000:
return self.factory.get_best_quality_provider()
else:
return self.factory.get_fastest_provider()

def _analyze_complexity(self, query: str) -> str:
"""Analyze query complexity."""
word_count = len(query.split())
has_technical_terms = any(term in query.lower() for term in
["analyze", "compare", "explain", "evaluate"])

if word_count < 10 and not has_technical_terms:
return "simple"
elif word_count > 50 or has_technical_terms:
return "complex"
else:
return "medium"

2. A/B Testing

class ProviderABTesting:
def __init__(self, factory: ProviderFactory):
self.factory = factory
self.test_groups = {}
self.results = {}

def assign_to_group(self, user_id: str) -> str:
"""Assign user to A/B test group."""
if user_id not in self.test_groups:
# Random assignment
self.test_groups[user_id] = "A" if hash(user_id) % 2 == 0 else "B"

return self.test_groups[user_id]

def get_provider_for_user(self, user_id: str) -> str:
"""Get provider based on user's test group."""
group = self.assign_to_group(user_id)

if group == "A":
return self.factory.get_provider("openai")
else:
return self.factory.get_provider("anthropic")

def record_result(self, user_id: str, provider: str,
response_time: float, quality_score: float):
"""Record A/B test result."""
group = self.test_groups[user_id]

if group not in self.results:
self.results[group] = {
"response_times": [],
"quality_scores": [],
"provider": provider
}

self.results[group]["response_times"].append(response_time)
self.results[group]["quality_scores"].append(quality_score)

def get_test_results(self) -> Dict[str, Any]:
"""Get A/B test results."""
results = {}

for group, data in self.results.items():
results[group] = {
"avg_response_time": sum(data["response_times"]) / len(data["response_times"]),
"avg_quality_score": sum(data["quality_scores"]) / len(data["quality_scores"]),
"provider": data["provider"],
"sample_size": len(data["response_times"])
}

return results

3. Load Balancing

class ProviderLoadBalancer:
def __init__(self, factory: ProviderFactory):
self.factory = factory
self.active_requests = {}
self.request_queues = {}

def get_provider_with_load_balancing(self) -> str:
"""Get provider with load balancing."""
available_providers = self.factory.list_providers()

# Calculate load for each provider
provider_loads = {}
for provider in available_providers:
active = self.active_requests.get(provider, 0)
queued = len(self.request_queues.get(provider, []))
provider_loads[provider] = active + queued

# Select provider with lowest load
if provider_loads:
return min(provider_loads, key=provider_loads.get)
else:
return available_providers[0]

def start_request(self, provider: str, request_id: str):
"""Track request start."""
if provider not in self.active_requests:
self.active_requests[provider] = 0
self.active_requests[provider] += 1

def end_request(self, provider: str, request_id: str):
"""Track request end."""
if provider in self.active_requests:
self.active_requests[provider] = max(0, self.active_requests[provider] - 1)

Best Practices

1. Provider Selection

  • Use cost-based routing for high-volume applications
  • Use quality-based routing for critical business logic
  • Use latency-based routing for real-time applications
  • Implement fallback for high availability

2. Cost Management

  • Set cost limits to prevent budget overruns
  • Monitor usage across all providers
  • Optimize for cost when quality requirements are met
  • Use free tiers for development and testing

3. Performance Optimization

  • Cache responses for repeated queries
  • Batch requests when possible
  • Monitor latency and optimize routing
  • Use appropriate models for task complexity

4. Error Handling

  • Implement retry logic with exponential backoff
  • Handle rate limits gracefully
  • Monitor provider health continuously
  • Provide fallback responses when all providers fail

Monitoring and Alerting

Key Metrics to Monitor

Performance Metrics:

  • Response time per provider
  • Success rate per provider
  • Error rate per provider
  • Throughput per provider

Cost Metrics:

  • Cost per request per provider
  • Daily/monthly costs per provider
  • Cost per successful request
  • Budget utilization

Quality Metrics:

  • Response quality scores
  • User satisfaction ratings
  • Task completion rates
  • Error types and frequencies

Alerting Rules

# Example alerting rules
alerting_rules = {
"high_error_rate": {
"condition": "error_rate > 0.05",
"duration": "5m",
"severity": "warning"
},
"cost_limit_exceeded": {
"condition": "daily_cost > daily_limit * 0.8",
"duration": "1m",
"severity": "critical"
},
"provider_down": {
"condition": "success_rate == 0",
"duration": "2m",
"severity": "critical"
},
"high_latency": {
"condition": "avg_response_time > 10s",
"duration": "5m",
"severity": "warning"
}
}

Future Enhancements

Planned Features

  • Model-specific routing: Route to specific models based on task type
  • Quality prediction: Predict response quality before sending request
  • Cost prediction: Predict request cost before sending
  • Auto-scaling: Automatically adjust provider usage based on load
  • Custom providers: Support for custom/private LLM providers

Research Areas

  • Adaptive routing: Learn optimal routing from usage patterns
  • Quality assessment: Real-time quality scoring
  • Cost optimization: Advanced cost optimization algorithms
  • Performance prediction: Predict performance before request
  • Multi-modal support: Support for vision and audio models

This architecture provides a robust, scalable, and cost-effective foundation for using multiple LLM providers in production applications.