LG MLJun 11, 2020

Maximal Objectives in the Multi-armed Bandit with Applications

arXiv:2006.06853v61.2

Originality Incremental advance

AI Analysis

This addresses operational concerns in online platforms for grooming adequate supply of market participants, representing a novel objective but incremental methodologically.

The paper tackles the problem of maximizing the highest total reward across arms in a multi-armed bandit setting, showing that any policy incurs an instance-dependent asymptotic regret of Ω(log T) and worst-case regret of Ω(K^{1/3}T^{2/3}), and designs an adaptive policy that achieves these bounds up to logarithmic factors.

In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected total reward can be inappropriate. In this paper, motivated by certain operational concerns in online platforms, we consider a new objective in the classical setup. Given $K$ arms, instead of maximizing the expected total reward from $T$ pulls (the traditional "sum" objective), we consider the vector of total rewards earned from each of the $K$ arms at the end of $T$ pulls and aim to maximize the expected highest total reward across arms (the "max" objective). For this objective, we show that any policy must incur an instance-dependent asymptotic regret of $Ω(\log T)$ (with a higher instance-dependent constant compared to the traditional objective) and a worst-case regret of $Ω(K^{1/3}T^{2/3})$. We then design an adaptive explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). We then generalize our algorithmic insights to the problem of maximizing the expected value of the average total reward of the top $m$ arms with the highest total rewards. Our numerical experiments demonstrate the efficacy of our policies compared to several natural alternatives in practical parameter regimes. We discuss applications of these new objectives to the problem of grooming an adequate supply of value-providing market participants (workers/sellers/service providers) in online platforms.

View on arXiv PDF

Similar