Bandit Algorithms: Exploration vs Exploitation
This document explains the fundamental concepts behind bandit algorithms and how they solve the exploration-exploitation trade-off in recommendation systems.
The Multi-Armed Bandit Problem
The multi-armed bandit problem is a classic problem in probability theory and machine learning. Imagine you're in a casino with multiple slot machines (arms), each with an unknown probability of paying out. Your goal is to maximize your total reward over time by deciding which machines to play.
The Core Challenge
The fundamental challenge is the exploration-exploitation trade-off:
- Exploitation: Play the machine that has given you the best results so far
- Exploration: Try other machines to discover if they might be better
Too much exploitation means you might miss better opportunities. Too much exploration means you're not capitalizing on what you've already learned.
Bandit Algorithms in Recommendation Systems
In recommendation systems, bandit algorithms help solve similar problems:
- Which recommendation algorithm should I use for this user?
- Which product should I recommend to maximize conversion?
- Which ad should I show to maximize click-through rate?
Real-World Example
Consider an e-commerce site testing three recommendation strategies:
- Collaborative Filtering: Based on similar users
- Content-Based: Based on item features
- Hybrid: Combination of both
Each strategy has an unknown conversion rate. A bandit algorithm helps decide which strategy to use for each user to maximize overall conversions.
Types of Bandit Algorithms
1. Epsilon-Greedy
The simplest bandit algorithm that balances exploration and exploitation.
How it Works
def epsilon_greedy(epsilon=0.1):
if random.random() < epsilon:
# Explore: choose random arm
return random.choice(arms)
else:
# Exploit: choose best arm so far
return best_arm_so_far()
Characteristics
- Simple: Easy to understand and implement
- Tunable: Control exploration with epsilon parameter
- Predictable: Fixed exploration rate
- Suboptimal: Doesn't adapt exploration based on uncertainty
When to Use
- Getting started with bandit algorithms
- When you want predictable behavior
- Simple scenarios with few arms
2. Upper Confidence Bound (UCB)
UCB uses confidence intervals to make decisions, being optimistic about uncertain arms.
How it Works
def ucb(arm, time_step):
if arm.pulls == 0:
return float('inf') # Pull each arm once
# Calculate confidence interval
confidence = sqrt(2 * log(time_step) / arm.pulls)
return arm.average_reward + confidence
The UCB Formula
For each arm, UCB calculates:
UCB(i) = μᵢ + c × √(2 × ln(t) / nᵢ)
Where:
μᵢ= average reward of arm ic= confidence parameter (usually 2)t= total time stepsnᵢ= number of times arm i has been pulled
Characteristics
- Optimistic: Assumes uncertain arms might be good
- Adaptive: Exploration decreases over time
- Theoretical Guarantees: Provable regret bounds
- Deterministic: No randomness in selection
When to Use
- When you want theoretical guarantees
- Scenarios with clear time horizons
- When you prefer deterministic behavior
3. Thompson Sampling
A Bayesian approach that samples from the posterior distribution of each arm's reward.
How it Works
def thompson_sampling():
samples = []
for arm in arms:
# Sample from Beta distribution
sample = beta.rvs(arm.alpha, arm.beta)
samples.append(sample)
# Choose arm with highest sample
return arms[np.argmax(samples)]
Bayesian Framework
Thompson Sampling maintains a Beta distribution for each arm:
- Prior: Beta(α₀, β₀) - initial beliefs
- Likelihood: Bernoulli trials (success/failure)
- Posterior: Beta(α₀ + successes, β₀ + failures)
Characteristics
- Bayesian: Incorporates prior knowledge
- Adaptive: Naturally balances exploration and exploitation
- Empirically Strong: Often performs best in practice
- Probabilistic: Uses sampling for decisions
When to Use
- Most practical scenarios
- When you have prior knowledge
- When you want adaptive exploration
- Complex, non-stationary environments
4. Contextual Bandits (LinUCB)
Contextual bandits use additional information (context) to make better decisions.
How it Works
def linucb(context, arm):
# Calculate confidence interval for linear model
confidence = alpha * sqrt(context.T @ arm.A_inv @ context)
return arm.theta.T @ context + confidence
Contextual Features
Context can include:
- User demographics (age, location, income)
- Time information (hour, day, season)
- Device information (mobile, desktop, tablet)
- Session information (new user, returning user)
Characteristics
- Context-Aware: Uses additional information
- Linear Model: Assumes linear relationship between context and reward
- Scalable: Can handle high-dimensional contexts
- Personalized: Different decisions for different contexts
When to Use
- When you have rich contextual information
- Personalization scenarios
- High-dimensional feature spaces
- When linear assumptions are reasonable
Regret and Performance Metrics
Cumulative Regret
Regret measures how much worse you did compared to the optimal strategy:
Regret(T) = Σᵗ₌₁ᵀ [μ* - μᵢₜ]
Where:
μ*= reward of the best armμᵢₜ= reward of the arm chosen at time tT= total time horizon
Types of Regret
- Cumulative Regret: Total regret over time
- Average Regret: Regret per time step
- Pseudo-Regret: Expected regret (theoretical)
Regret Bounds
Different algorithms have different theoretical guarantees:
- Epsilon-Greedy: O(T^(2/3)) regret
- UCB: O(√(T log T)) regret
- Thompson Sampling: O(√(T log T)) regret
- LinUCB: O(√(T log T)) regret
Practical Considerations
1. Non-Stationary Environments
In real-world scenarios, arm rewards can change over time:
def adaptive_bandit():
# Decay old observations
for arm in arms:
arm.rewards *= decay_factor
arm.pulls *= decay_factor
2. Delayed Feedback
Sometimes feedback arrives with delay:
def delayed_feedback():
# Store pending feedback
pending_feedback[user_id] = {
'arm': selected_arm,
'timestamp': current_time,
'context': user_context
}
# Process delayed feedback when it arrives
def process_feedback(user_id, reward):
feedback = pending_feedback.pop(user_id)
bandit.update(feedback['arm'], reward)
3. Multiple Objectives
Sometimes you want to optimize multiple metrics:
def multi_objective_bandit():
# Weighted combination of objectives
combined_reward = (
w1 * conversion_rate +
w2 * revenue +
w3 * user_satisfaction
)
return combined_reward
Implementation Best Practices
1. Start Simple
Begin with epsilon-greedy and move to more sophisticated algorithms:
# Start with epsilon-greedy
bandit = EpsilonGreedyBandit(epsilon=0.1)
# Move to Thompson Sampling
bandit = ThompsonSamplingBandit()
# Add context if available
bandit = LinUCBBandit(n_features=10)
2. Monitor Performance
Track key metrics:
def monitor_bandit(bandit):
metrics = {
'cumulative_reward': bandit.total_reward,
'average_reward': bandit.total_reward / bandit.total_pulls,
'exploration_rate': bandit.exploration_pulls / bandit.total_pulls,
'arm_distribution': bandit.arm_counts
}
return metrics
3. Handle Edge Cases
def robust_bandit_selection(bandit, context=None):
try:
if context is not None:
return bandit.select_arm(context=context)
else:
return bandit.select_arm()
except Exception as e:
# Fallback to random selection
return random.choice(bandit.arms)
4. A/B Testing Integration
def ab_test_with_bandit(test_name, variants):
bandit = ThompsonSamplingBandit(n_arms=len(variants))
def select_variant(user_id, context=None):
arm_index = bandit.select_arm(context=context)
return variants[arm_index], arm_index
def update_feedback(arm_index, reward, context=None):
bandit.update(arm_index, reward, context=context)
return select_variant, update_feedback
Common Pitfalls
1. Insufficient Exploration
# Bad: Too little exploration
bandit = EpsilonGreedyBandit(epsilon=0.01) # Only 1% exploration
# Good: Balanced exploration
bandit = EpsilonGreedyBandit(epsilon=0.1) # 10% exploration
2. Ignoring Context
# Bad: Ignoring available context
def select_arm(user_context):
return bandit.select_arm() # Not using context
# Good: Using context
def select_arm(user_context):
return bandit.select_arm(context=user_context)
3. Not Handling Non-Stationarity
# Bad: Not adapting to changes
bandit = ThompsonSamplingBandit()
# Never update priors or decay old data
# Good: Adapting to changes
bandit = ThompsonSamplingBandit()
bandit.decay_factor = 0.99 # Decay old observations
4. Poor Reward Design
# Bad: Binary rewards only
reward = 1 if user_clicked else 0
# Good: Rich reward signal
reward = (
0.1 * user_viewed +
0.3 * user_clicked +
0.7 * user_added_to_cart +
1.0 * user_purchased
)
Advanced Topics
1. Hierarchical Bandits
For complex scenarios with multiple levels of decisions:
class HierarchicalBandit:
def __init__(self):
self.category_bandit = ThompsonSamplingBandit(n_arms=5)
self.product_bandits = {
category: ThompsonSamplingBandit(n_arms=10)
for category in categories
}
def select_product(self, user_context):
# First select category
category = self.category_bandit.select_arm(context=user_context)
# Then select product within category
product = self.product_bandits[category].select_arm(context=user_context)
return category, product
2. Combinatorial Bandits
When you need to select multiple items simultaneously:
class CombinatorialBandit:
def __init__(self, n_items, k_recommendations):
self.n_items = n_items
self.k_recommendations = k_recommendations
self.item_bandits = [ThompsonSamplingBandit() for _ in range(n_items)]
def select_items(self, user_context):
# Get scores for all items
scores = []
for i, bandit in enumerate(self.item_bandits):
score = bandit.select_arm(context=user_context)
scores.append((i, score))
# Select top k items
top_items = sorted(scores, key=lambda x: x[1], reverse=True)[:self.k_recommendations]
return [item[0] for item in top_items]
3. Federated Bandits
For distributed scenarios where data is spread across multiple sources:
class FederatedBandit:
def __init__(self, n_clients):
self.client_bandits = [ThompsonSamplingBandit() for _ in range(n_clients)]
self.global_bandit = ThompsonSamplingBandit()
def select_arm(self, client_id, context):
# Use local bandit for client-specific decisions
local_arm = self.client_bandits[client_id].select_arm(context=context)
# Also consider global information
global_arm = self.global_bandit.select_arm(context=context)
# Combine local and global information
return self.combine_arms(local_arm, global_arm)
Conclusion
Bandit algorithms provide a principled approach to solving the exploration-exploitation trade-off in recommendation systems. By choosing the right algorithm and implementing it correctly, you can significantly improve the performance of your recommendation system while continuously learning and adapting to user preferences.
The key is to start simple, monitor performance, and gradually add complexity as needed. With proper implementation, bandit algorithms can help you build more intelligent, adaptive recommendation systems that deliver better results for both users and businesses.