Bandit Algorithms: Exploration vs Exploitation

This document explains the fundamental concepts behind bandit algorithms and how they solve the exploration-exploitation trade-off in recommendation systems.

The Multi-Armed Bandit Problem

The multi-armed bandit problem is a classic problem in probability theory and machine learning. Imagine you're in a casino with multiple slot machines (arms), each with an unknown probability of paying out. Your goal is to maximize your total reward over time by deciding which machines to play.

The Core Challenge

The fundamental challenge is the exploration-exploitation trade-off:

Exploitation: Play the machine that has given you the best results so far
Exploration: Try other machines to discover if they might be better

Too much exploitation means you might miss better opportunities. Too much exploration means you're not capitalizing on what you've already learned.

Bandit Algorithms in Recommendation Systems

In recommendation systems, bandit algorithms help solve similar problems:

Which recommendation algorithm should I use for this user?
Which product should I recommend to maximize conversion?
Which ad should I show to maximize click-through rate?

Real-World Example

Consider an e-commerce site testing three recommendation strategies:

Collaborative Filtering: Based on similar users
Content-Based: Based on item features
Hybrid: Combination of both

Each strategy has an unknown conversion rate. A bandit algorithm helps decide which strategy to use for each user to maximize overall conversions.

Types of Bandit Algorithms

1. Epsilon-Greedy

The simplest bandit algorithm that balances exploration and exploitation.

How it Works

def epsilon_greedy(epsilon=0.1):
    if random.random() < epsilon:
        # Explore: choose random arm
        return random.choice(arms)
    else:
        # Exploit: choose best arm so far
        return best_arm_so_far()

Characteristics

Simple: Easy to understand and implement
Tunable: Control exploration with epsilon parameter
Predictable: Fixed exploration rate
Suboptimal: Doesn't adapt exploration based on uncertainty

When to Use

Getting started with bandit algorithms
When you want predictable behavior
Simple scenarios with few arms

2. Upper Confidence Bound (UCB)

UCB uses confidence intervals to make decisions, being optimistic about uncertain arms.

How it Works

def ucb(arm, time_step):
    if arm.pulls == 0:
        return float('inf')  # Pull each arm once
    
    # Calculate confidence interval
    confidence = sqrt(2 * log(time_step) / arm.pulls)
    return arm.average_reward + confidence

The UCB Formula

For each arm, UCB calculates:

UCB(i) = μᵢ + c × √(2 × ln(t) / nᵢ)

Where:

μᵢ = average reward of arm i
c = confidence parameter (usually 2)
t = total time steps
nᵢ = number of times arm i has been pulled

Characteristics

Optimistic: Assumes uncertain arms might be good
Adaptive: Exploration decreases over time
Theoretical Guarantees: Provable regret bounds
Deterministic: No randomness in selection

When to Use

When you want theoretical guarantees
Scenarios with clear time horizons
When you prefer deterministic behavior

3. Thompson Sampling

A Bayesian approach that samples from the posterior distribution of each arm's reward.

How it Works

def thompson_sampling():
    samples = []
    for arm in arms:
        # Sample from Beta distribution
        sample = beta.rvs(arm.alpha, arm.beta)
        samples.append(sample)
    
    # Choose arm with highest sample
    return arms[np.argmax(samples)]

Bayesian Framework

Thompson Sampling maintains a Beta distribution for each arm:

Prior: Beta(α₀, β₀) - initial beliefs
Likelihood: Bernoulli trials (success/failure)
Posterior: Beta(α₀ + successes, β₀ + failures)

Characteristics

Bayesian: Incorporates prior knowledge
Adaptive: Naturally balances exploration and exploitation
Empirically Strong: Often performs best in practice
Probabilistic: Uses sampling for decisions

When to Use

Most practical scenarios
When you have prior knowledge
When you want adaptive exploration
Complex, non-stationary environments

4. Contextual Bandits (LinUCB)

Contextual bandits use additional information (context) to make better decisions.

How it Works

def linucb(context, arm):
    # Calculate confidence interval for linear model
    confidence = alpha * sqrt(context.T @ arm.A_inv @ context)
    return arm.theta.T @ context + confidence

Contextual Features

Context can include:

User demographics (age, location, income)
Time information (hour, day, season)
Device information (mobile, desktop, tablet)
Session information (new user, returning user)

Characteristics

Context-Aware: Uses additional information
Linear Model: Assumes linear relationship between context and reward
Scalable: Can handle high-dimensional contexts
Personalized: Different decisions for different contexts

When to Use

When you have rich contextual information
Personalization scenarios
High-dimensional feature spaces
When linear assumptions are reasonable

Regret and Performance Metrics

Cumulative Regret

Regret measures how much worse you did compared to the optimal strategy:

Regret(T) = Σᵗ₌₁ᵀ [μ* - μᵢₜ]

Where:

μ* = reward of the best arm
μᵢₜ = reward of the arm chosen at time t
T = total time horizon

Types of Regret

Cumulative Regret: Total regret over time
Average Regret: Regret per time step
Pseudo-Regret: Expected regret (theoretical)

Regret Bounds

Different algorithms have different theoretical guarantees:

Epsilon-Greedy: O(T^(2/3)) regret
UCB: O(√(T log T)) regret
Thompson Sampling: O(√(T log T)) regret
LinUCB: O(√(T log T)) regret

Practical Considerations

1. Non-Stationary Environments

In real-world scenarios, arm rewards can change over time:

def adaptive_bandit():
    # Decay old observations
    for arm in arms:
        arm.rewards *= decay_factor
        arm.pulls *= decay_factor

2. Delayed Feedback

Sometimes feedback arrives with delay:

def delayed_feedback():
    # Store pending feedback
    pending_feedback[user_id] = {
        'arm': selected_arm,
        'timestamp': current_time,
        'context': user_context
    }
    
    # Process delayed feedback when it arrives
    def process_feedback(user_id, reward):
        feedback = pending_feedback.pop(user_id)
        bandit.update(feedback['arm'], reward)

3. Multiple Objectives

Sometimes you want to optimize multiple metrics:

def multi_objective_bandit():
    # Weighted combination of objectives
    combined_reward = (
        w1 * conversion_rate +
        w2 * revenue +
        w3 * user_satisfaction
    )
    return combined_reward

Implementation Best Practices

1. Start Simple

Begin with epsilon-greedy and move to more sophisticated algorithms:

# Start with epsilon-greedy
bandit = EpsilonGreedyBandit(epsilon=0.1)

# Move to Thompson Sampling
bandit = ThompsonSamplingBandit()

# Add context if available
bandit = LinUCBBandit(n_features=10)

2. Monitor Performance

Track key metrics:

def monitor_bandit(bandit):
    metrics = {
        'cumulative_reward': bandit.total_reward,
        'average_reward': bandit.total_reward / bandit.total_pulls,
        'exploration_rate': bandit.exploration_pulls / bandit.total_pulls,
        'arm_distribution': bandit.arm_counts
    }
    return metrics

3. Handle Edge Cases

def robust_bandit_selection(bandit, context=None):
    try:
        if context is not None:
            return bandit.select_arm(context=context)
        else:
            return bandit.select_arm()
    except Exception as e:
        # Fallback to random selection
        return random.choice(bandit.arms)

4. A/B Testing Integration

def ab_test_with_bandit(test_name, variants):
    bandit = ThompsonSamplingBandit(n_arms=len(variants))
    
    def select_variant(user_id, context=None):
        arm_index = bandit.select_arm(context=context)
        return variants[arm_index], arm_index
    
    def update_feedback(arm_index, reward, context=None):
        bandit.update(arm_index, reward, context=context)
    
    return select_variant, update_feedback

Common Pitfalls

1. Insufficient Exploration

# Bad: Too little exploration
bandit = EpsilonGreedyBandit(epsilon=0.01)  # Only 1% exploration

# Good: Balanced exploration
bandit = EpsilonGreedyBandit(epsilon=0.1)   # 10% exploration

2. Ignoring Context

# Bad: Ignoring available context
def select_arm(user_context):
    return bandit.select_arm()  # Not using context

# Good: Using context
def select_arm(user_context):
    return bandit.select_arm(context=user_context)

3. Not Handling Non-Stationarity

# Bad: Not adapting to changes
bandit = ThompsonSamplingBandit()
# Never update priors or decay old data

# Good: Adapting to changes
bandit = ThompsonSamplingBandit()
bandit.decay_factor = 0.99  # Decay old observations

4. Poor Reward Design

# Bad: Binary rewards only
reward = 1 if user_clicked else 0

# Good: Rich reward signal
reward = (
    0.1 * user_viewed +
    0.3 * user_clicked +
    0.7 * user_added_to_cart +
    1.0 * user_purchased
)

Advanced Topics

1. Hierarchical Bandits

For complex scenarios with multiple levels of decisions:

class HierarchicalBandit:
    def __init__(self):
        self.category_bandit = ThompsonSamplingBandit(n_arms=5)
        self.product_bandits = {
            category: ThompsonSamplingBandit(n_arms=10)
            for category in categories
        }
    
    def select_product(self, user_context):
        # First select category
        category = self.category_bandit.select_arm(context=user_context)
        
        # Then select product within category
        product = self.product_bandits[category].select_arm(context=user_context)
        
        return category, product

2. Combinatorial Bandits

When you need to select multiple items simultaneously:

class CombinatorialBandit:
    def __init__(self, n_items, k_recommendations):
        self.n_items = n_items
        self.k_recommendations = k_recommendations
        self.item_bandits = [ThompsonSamplingBandit() for _ in range(n_items)]
    
    def select_items(self, user_context):
        # Get scores for all items
        scores = []
        for i, bandit in enumerate(self.item_bandits):
            score = bandit.select_arm(context=user_context)
            scores.append((i, score))
        
        # Select top k items
        top_items = sorted(scores, key=lambda x: x[1], reverse=True)[:self.k_recommendations]
        return [item[0] for item in top_items]

3. Federated Bandits

For distributed scenarios where data is spread across multiple sources:

class FederatedBandit:
    def __init__(self, n_clients):
        self.client_bandits = [ThompsonSamplingBandit() for _ in range(n_clients)]
        self.global_bandit = ThompsonSamplingBandit()
    
    def select_arm(self, client_id, context):
        # Use local bandit for client-specific decisions
        local_arm = self.client_bandits[client_id].select_arm(context=context)
        
        # Also consider global information
        global_arm = self.global_bandit.select_arm(context=context)
        
        # Combine local and global information
        return self.combine_arms(local_arm, global_arm)

Conclusion

Bandit algorithms provide a principled approach to solving the exploration-exploitation trade-off in recommendation systems. By choosing the right algorithm and implementing it correctly, you can significantly improve the performance of your recommendation system while continuously learning and adapting to user preferences.

The key is to start simple, monitor performance, and gradually add complexity as needed. With proper implementation, bandit algorithms can help you build more intelligent, adaptive recommendation systems that deliver better results for both users and businesses.

The Multi-Armed Bandit Problem​

The Core Challenge​

Bandit Algorithms in Recommendation Systems​

Real-World Example​

Types of Bandit Algorithms​

1. Epsilon-Greedy​

How it Works​

Characteristics​

When to Use​

2. Upper Confidence Bound (UCB)​

How it Works​

The UCB Formula​

Characteristics​

When to Use​

3. Thompson Sampling​

How it Works​

Bayesian Framework​

Characteristics​

When to Use​

4. Contextual Bandits (LinUCB)​

How it Works​

Contextual Features​

Characteristics​

When to Use​

Regret and Performance Metrics​

Cumulative Regret​

Types of Regret​

Regret Bounds​

Practical Considerations​

1. Non-Stationary Environments​

2. Delayed Feedback​

3. Multiple Objectives​

Implementation Best Practices​

1. Start Simple​

2. Monitor Performance​

3. Handle Edge Cases​

4. A/B Testing Integration​

Common Pitfalls​

1. Insufficient Exploration​

2. Ignoring Context​

3. Not Handling Non-Stationarity​

4. Poor Reward Design​

Advanced Topics​

1. Hierarchical Bandits​

2. Combinatorial Bandits​

3. Federated Bandits​

Conclusion​

The Multi-Armed Bandit Problem

The Core Challenge

Bandit Algorithms in Recommendation Systems

Real-World Example

Types of Bandit Algorithms

1. Epsilon-Greedy

How it Works

Characteristics

When to Use

2. Upper Confidence Bound (UCB)

How it Works

The UCB Formula

Characteristics

When to Use

3. Thompson Sampling

How it Works

Bayesian Framework

Characteristics

When to Use

4. Contextual Bandits (LinUCB)

How it Works

Contextual Features

Characteristics

When to Use

Regret and Performance Metrics

Cumulative Regret

Types of Regret

Regret Bounds

Practical Considerations

1. Non-Stationary Environments

2. Delayed Feedback

3. Multiple Objectives

Implementation Best Practices

1. Start Simple

2. Monitor Performance

3. Handle Edge Cases

4. A/B Testing Integration

Common Pitfalls

1. Insufficient Exploration

2. Ignoring Context

3. Not Handling Non-Stationarity

4. Poor Reward Design

Advanced Topics

1. Hierarchical Bandits

2. Combinatorial Bandits

3. Federated Bandits

Conclusion