Skip to main content

Bandit Algorithms: Exploration vs Exploitation

This document explains the fundamental concepts behind bandit algorithms and how they solve the exploration-exploitation trade-off in recommendation systems.

The Multi-Armed Bandit Problem

The multi-armed bandit problem is a classic problem in probability theory and machine learning. Imagine you're in a casino with multiple slot machines (arms), each with an unknown probability of paying out. Your goal is to maximize your total reward over time by deciding which machines to play.

The Core Challenge

The fundamental challenge is the exploration-exploitation trade-off:

  • Exploitation: Play the machine that has given you the best results so far
  • Exploration: Try other machines to discover if they might be better

Too much exploitation means you might miss better opportunities. Too much exploration means you're not capitalizing on what you've already learned.

Bandit Algorithms in Recommendation Systems

In recommendation systems, bandit algorithms help solve similar problems:

  • Which recommendation algorithm should I use for this user?
  • Which product should I recommend to maximize conversion?
  • Which ad should I show to maximize click-through rate?

Real-World Example

Consider an e-commerce site testing three recommendation strategies:

  1. Collaborative Filtering: Based on similar users
  2. Content-Based: Based on item features
  3. Hybrid: Combination of both

Each strategy has an unknown conversion rate. A bandit algorithm helps decide which strategy to use for each user to maximize overall conversions.

Types of Bandit Algorithms

1. Epsilon-Greedy

The simplest bandit algorithm that balances exploration and exploitation.

How it Works

def epsilon_greedy(epsilon=0.1):
if random.random() < epsilon:
# Explore: choose random arm
return random.choice(arms)
else:
# Exploit: choose best arm so far
return best_arm_so_far()

Characteristics

  • Simple: Easy to understand and implement
  • Tunable: Control exploration with epsilon parameter
  • Predictable: Fixed exploration rate
  • Suboptimal: Doesn't adapt exploration based on uncertainty

When to Use

  • Getting started with bandit algorithms
  • When you want predictable behavior
  • Simple scenarios with few arms

2. Upper Confidence Bound (UCB)

UCB uses confidence intervals to make decisions, being optimistic about uncertain arms.

How it Works

def ucb(arm, time_step):
if arm.pulls == 0:
return float('inf') # Pull each arm once

# Calculate confidence interval
confidence = sqrt(2 * log(time_step) / arm.pulls)
return arm.average_reward + confidence

The UCB Formula

For each arm, UCB calculates:

UCB(i) = μᵢ + c × √(2 × ln(t) / nᵢ)

Where:

  • μᵢ = average reward of arm i
  • c = confidence parameter (usually 2)
  • t = total time steps
  • nᵢ = number of times arm i has been pulled

Characteristics

  • Optimistic: Assumes uncertain arms might be good
  • Adaptive: Exploration decreases over time
  • Theoretical Guarantees: Provable regret bounds
  • Deterministic: No randomness in selection

When to Use

  • When you want theoretical guarantees
  • Scenarios with clear time horizons
  • When you prefer deterministic behavior

3. Thompson Sampling

A Bayesian approach that samples from the posterior distribution of each arm's reward.

How it Works

def thompson_sampling():
samples = []
for arm in arms:
# Sample from Beta distribution
sample = beta.rvs(arm.alpha, arm.beta)
samples.append(sample)

# Choose arm with highest sample
return arms[np.argmax(samples)]

Bayesian Framework

Thompson Sampling maintains a Beta distribution for each arm:

  • Prior: Beta(α₀, β₀) - initial beliefs
  • Likelihood: Bernoulli trials (success/failure)
  • Posterior: Beta(α₀ + successes, β₀ + failures)

Characteristics

  • Bayesian: Incorporates prior knowledge
  • Adaptive: Naturally balances exploration and exploitation
  • Empirically Strong: Often performs best in practice
  • Probabilistic: Uses sampling for decisions

When to Use

  • Most practical scenarios
  • When you have prior knowledge
  • When you want adaptive exploration
  • Complex, non-stationary environments

4. Contextual Bandits (LinUCB)

Contextual bandits use additional information (context) to make better decisions.

How it Works

def linucb(context, arm):
# Calculate confidence interval for linear model
confidence = alpha * sqrt(context.T @ arm.A_inv @ context)
return arm.theta.T @ context + confidence

Contextual Features

Context can include:

  • User demographics (age, location, income)
  • Time information (hour, day, season)
  • Device information (mobile, desktop, tablet)
  • Session information (new user, returning user)

Characteristics

  • Context-Aware: Uses additional information
  • Linear Model: Assumes linear relationship between context and reward
  • Scalable: Can handle high-dimensional contexts
  • Personalized: Different decisions for different contexts

When to Use

  • When you have rich contextual information
  • Personalization scenarios
  • High-dimensional feature spaces
  • When linear assumptions are reasonable

Regret and Performance Metrics

Cumulative Regret

Regret measures how much worse you did compared to the optimal strategy:

Regret(T) = Σᵗ₌₁ᵀ [μ* - μᵢₜ]

Where:

  • μ* = reward of the best arm
  • μᵢₜ = reward of the arm chosen at time t
  • T = total time horizon

Types of Regret

  1. Cumulative Regret: Total regret over time
  2. Average Regret: Regret per time step
  3. Pseudo-Regret: Expected regret (theoretical)

Regret Bounds

Different algorithms have different theoretical guarantees:

  • Epsilon-Greedy: O(T^(2/3)) regret
  • UCB: O(√(T log T)) regret
  • Thompson Sampling: O(√(T log T)) regret
  • LinUCB: O(√(T log T)) regret

Practical Considerations

1. Non-Stationary Environments

In real-world scenarios, arm rewards can change over time:

def adaptive_bandit():
# Decay old observations
for arm in arms:
arm.rewards *= decay_factor
arm.pulls *= decay_factor

2. Delayed Feedback

Sometimes feedback arrives with delay:

def delayed_feedback():
# Store pending feedback
pending_feedback[user_id] = {
'arm': selected_arm,
'timestamp': current_time,
'context': user_context
}

# Process delayed feedback when it arrives
def process_feedback(user_id, reward):
feedback = pending_feedback.pop(user_id)
bandit.update(feedback['arm'], reward)

3. Multiple Objectives

Sometimes you want to optimize multiple metrics:

def multi_objective_bandit():
# Weighted combination of objectives
combined_reward = (
w1 * conversion_rate +
w2 * revenue +
w3 * user_satisfaction
)
return combined_reward

Implementation Best Practices

1. Start Simple

Begin with epsilon-greedy and move to more sophisticated algorithms:

# Start with epsilon-greedy
bandit = EpsilonGreedyBandit(epsilon=0.1)

# Move to Thompson Sampling
bandit = ThompsonSamplingBandit()

# Add context if available
bandit = LinUCBBandit(n_features=10)

2. Monitor Performance

Track key metrics:

def monitor_bandit(bandit):
metrics = {
'cumulative_reward': bandit.total_reward,
'average_reward': bandit.total_reward / bandit.total_pulls,
'exploration_rate': bandit.exploration_pulls / bandit.total_pulls,
'arm_distribution': bandit.arm_counts
}
return metrics

3. Handle Edge Cases

def robust_bandit_selection(bandit, context=None):
try:
if context is not None:
return bandit.select_arm(context=context)
else:
return bandit.select_arm()
except Exception as e:
# Fallback to random selection
return random.choice(bandit.arms)

4. A/B Testing Integration

def ab_test_with_bandit(test_name, variants):
bandit = ThompsonSamplingBandit(n_arms=len(variants))

def select_variant(user_id, context=None):
arm_index = bandit.select_arm(context=context)
return variants[arm_index], arm_index

def update_feedback(arm_index, reward, context=None):
bandit.update(arm_index, reward, context=context)

return select_variant, update_feedback

Common Pitfalls

1. Insufficient Exploration

# Bad: Too little exploration
bandit = EpsilonGreedyBandit(epsilon=0.01) # Only 1% exploration

# Good: Balanced exploration
bandit = EpsilonGreedyBandit(epsilon=0.1) # 10% exploration

2. Ignoring Context

# Bad: Ignoring available context
def select_arm(user_context):
return bandit.select_arm() # Not using context

# Good: Using context
def select_arm(user_context):
return bandit.select_arm(context=user_context)

3. Not Handling Non-Stationarity

# Bad: Not adapting to changes
bandit = ThompsonSamplingBandit()
# Never update priors or decay old data

# Good: Adapting to changes
bandit = ThompsonSamplingBandit()
bandit.decay_factor = 0.99 # Decay old observations

4. Poor Reward Design

# Bad: Binary rewards only
reward = 1 if user_clicked else 0

# Good: Rich reward signal
reward = (
0.1 * user_viewed +
0.3 * user_clicked +
0.7 * user_added_to_cart +
1.0 * user_purchased
)

Advanced Topics

1. Hierarchical Bandits

For complex scenarios with multiple levels of decisions:

class HierarchicalBandit:
def __init__(self):
self.category_bandit = ThompsonSamplingBandit(n_arms=5)
self.product_bandits = {
category: ThompsonSamplingBandit(n_arms=10)
for category in categories
}

def select_product(self, user_context):
# First select category
category = self.category_bandit.select_arm(context=user_context)

# Then select product within category
product = self.product_bandits[category].select_arm(context=user_context)

return category, product

2. Combinatorial Bandits

When you need to select multiple items simultaneously:

class CombinatorialBandit:
def __init__(self, n_items, k_recommendations):
self.n_items = n_items
self.k_recommendations = k_recommendations
self.item_bandits = [ThompsonSamplingBandit() for _ in range(n_items)]

def select_items(self, user_context):
# Get scores for all items
scores = []
for i, bandit in enumerate(self.item_bandits):
score = bandit.select_arm(context=user_context)
scores.append((i, score))

# Select top k items
top_items = sorted(scores, key=lambda x: x[1], reverse=True)[:self.k_recommendations]
return [item[0] for item in top_items]

3. Federated Bandits

For distributed scenarios where data is spread across multiple sources:

class FederatedBandit:
def __init__(self, n_clients):
self.client_bandits = [ThompsonSamplingBandit() for _ in range(n_clients)]
self.global_bandit = ThompsonSamplingBandit()

def select_arm(self, client_id, context):
# Use local bandit for client-specific decisions
local_arm = self.client_bandits[client_id].select_arm(context=context)

# Also consider global information
global_arm = self.global_bandit.select_arm(context=context)

# Combine local and global information
return self.combine_arms(local_arm, global_arm)

Conclusion

Bandit algorithms provide a principled approach to solving the exploration-exploitation trade-off in recommendation systems. By choosing the right algorithm and implementing it correctly, you can significantly improve the performance of your recommendation system while continuously learning and adapting to user preferences.

The key is to start simple, monitor performance, and gradually add complexity as needed. With proper implementation, bandit algorithms can help you build more intelligent, adaptive recommendation systems that deliver better results for both users and businesses.