LG GTDec 14, 2024

p-Mean Regret for Stochastic Bandits

Anand Krishna, Philips George John, Adarsh Barik, Vincent Y. F. Tan

arXiv:2412.10751v111.55 citationsh-index: 5Has CodeAAAI

Originality Incremental advance

AI Analysis

This work addresses algorithm designers in stochastic bandits by offering a unified framework for regret analysis, though it is incremental as it builds on existing UCB methods and prior regret concepts.

The paper tackles the problem of evaluating bandit algorithms by extending the p-mean welfare objective to define p-mean regret, providing a flexible framework to balance fairness and efficiency. It introduces a UCB-based algorithm that achieves novel regret bounds, such as O~(sqrt(k/T^(1/(2|p|)))) for p ≤ -1, matching lower bounds for 0 < p ≤ 1.

In this work, we extend the concept of the $p$-mean welfare objective from social choice theory (Moulin 2004) to study $p$-mean regret in stochastic multi-armed bandit problems. The $p$-mean regret, defined as the difference between the optimal mean among the arms and the $p$-mean of the expected rewards, offers a flexible framework for evaluating bandit algorithms, enabling algorithm designers to balance fairness and efficiency by adjusting the parameter $p$. Our framework encompasses both average cumulative regret and Nash regret as special cases. We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that achieves novel $p$-mean regret bounds. Our algorithm consists of two phases: a carefully calibrated uniform exploration phase to initialize sample means, followed by the UCB1 algorithm of Auer, Cesa-Bianchi, and Fischer (2002). Under mild assumptions, we prove that our algorithm achieves a $p$-mean regret bound of $\tilde{O}\left(\sqrt{\frac{k}{T^{\frac{1}{2|p|}}}}\right)$ for all $p \leq -1$, where $k$ represents the number of arms and $T$ the time horizon. When $-1<p<0$, we achieve a regret bound of $\tilde{O}\left(\sqrt{\frac{k^{1.5}}{T^{\frac{1}{2}}}}\right)$. For the range $0< p \leq 1$, we achieve a $p$-mean regret scaling as $\tilde{O}\left(\sqrt{\frac{k}{T}}\right)$, which matches the previously established lower bound up to logarithmic factors (Auer et al. 1995). This result stems from the fact that the $p$-mean regret of any algorithm is at least its average cumulative regret for $p \leq 1$. In the case of Nash regret (the limit as $p$ approaches zero), our unified approach differs from prior work (Barman et al. 2023), which requires a new Nash Confidence Bound algorithm. Notably, we achieve the same regret bound up to constant factors using our more general method.

View on arXiv PDF Code

Similar