LGMLFeb 15, 2018

Bandit Learning with Positive Externalities

arXiv:1802.05693v518 citations
AI Analysis

This addresses a key challenge in platforms with self-reinforcing user behavior, offering a novel solution with proven optimality, though it is incremental in the context of bandit algorithms.

The paper tackles the problem of multi-armed bandit learning with positive externalities, where user arrivals are self-reinforcing, and shows that standard algorithms like UCB can suffer linear regret. It introduces a new algorithm, Balanced Exploration (BE), and an adaptive variant, proving their asymptotic regret optimality.

In many platforms, user arrivals exhibit a self-reinforcing behavior: future user arrivals are likely to have preferences similar to users who were satisfied in the past. In other words, arrivals exhibit positive externalities. We study multiarmed bandit (MAB) problems with positive externalities. We show that the self-reinforcing preferences may lead standard benchmark algorithms such as UCB to exhibit linear regret. We develop a new algorithm, Balanced Exploration (BE), which explores arms carefully to avoid suboptimal convergence of arrivals before sufficient evidence is gathered. We also introduce an adaptive variant of BE which successively eliminates suboptimal arms. We analyze their asymptotic regret, and establish optimality by showing that no algorithm can perform better.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes