Bandit Learning with Positive Externalities
This addresses a key challenge in platforms with self-reinforcing user behavior, offering a novel solution with proven optimality, though it is incremental in the context of bandit algorithms.
The paper tackles the problem of multi-armed bandit learning with positive externalities, where user arrivals are self-reinforcing, and shows that standard algorithms like UCB can suffer linear regret. It introduces a new algorithm, Balanced Exploration (BE), and an adaptive variant, proving their asymptotic regret optimality.
In many platforms, user arrivals exhibit a self-reinforcing behavior: future user arrivals are likely to have preferences similar to users who were satisfied in the past. In other words, arrivals exhibit positive externalities. We study multiarmed bandit (MAB) problems with positive externalities. We show that the self-reinforcing preferences may lead standard benchmark algorithms such as UCB to exhibit linear regret. We develop a new algorithm, Balanced Exploration (BE), which explores arms carefully to avoid suboptimal convergence of arrivals before sufficient evidence is gathered. We also introduce an adaptive variant of BE which successively eliminates suboptimal arms. We analyze their asymptotic regret, and establish optimality by showing that no algorithm can perform better.