LGMLJun 8, 2021

Scale Free Adversarial Multi Armed Bandits

arXiv:2106.04700v212 citations
AI Analysis

This addresses the problem of robust decision-making under uncertainty in adversarial environments for researchers in online learning and bandit algorithms, representing a significant advance with novel bounds.

The paper tackles the scale-free adversarial multi-armed bandits problem, where the player lacks prior knowledge of loss scales or rounds, and achieves regret bounds of $ ilde{\mathcal{O}}(\sqrt{nL_2} + L_\infty\sqrt{nT})$ with non-adaptive exploration and $ ilde{\mathcal{O}}(\sqrt{nL_2} + L_\infty\sqrt{nL_1})$ with adaptive exploration, which are the first to adapt to $\|\cdot\|_2$ and $\|\cdot\|_1$ norms of losses.

We consider the Scale-Free Adversarial Multi Armed Bandits(MAB) problem. At the beginning of the game, the player only knows the number of arms $n$. It does not know the scale and magnitude of the losses chosen by the adversary or the number of rounds $T$. In each round, it sees bandit feedback about the loss vectors $l_1,\dots, l_T \in \mathbb{R}^n$. The goal is to bound its regret as a function of $n$ and norms of $l_1,\dots, l_T$. We design a bandit Follow The Regularized Leader (FTRL) algorithm, that uses an adaptive learning rate and give two different regret bounds, based on the exploration parameter used. With non-adaptive exploration, our algorithm has a regret of $\tilde{\mathcal{O}}(\sqrt{nL_2} + L_\infty\sqrt{nT})$ and with adaptive exploration, it has a regret of $\tilde{\mathcal{O}}(\sqrt{nL_2} + L_\infty\sqrt{nL_1})$. Here $L_\infty = \sup_t \| l_t\|_\infty$, $L_2 = \sum_{t=1}^T \|l_t\|_2^2$, $L_1 = \sum_{t=1}^T \|l_t\|_1$ and the $\tilde{\mathcal{O}}$ notation suppress logarithmic factors. These are the first MAB bounds that adapt to the $\|\cdot\|_2$, $\|\cdot\|_1$ norms of the losses. The second bound is the first data-dependent scale-free MAB bound as $T$ does not directly appear in the regret. We also develop a new technique for obtaining a rich class of local-norm lower-bounds for Bregman Divergences. This technique plays a crucial role in our analysis for controlling the regret when using importance weighted estimators of unbounded losses. This technique could be of independent interest.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes