LGDSOCMLJan 31, 2022

Rotting Infinitely Many-armed Bandits

arXiv:2201.12975v36 citations
AI Analysis

This addresses a theoretical challenge in online learning with non-stationary rewards, providing algorithms for scenarios where arm quality decays, which is incremental as it extends prior bandit models.

The paper tackles the infinitely many-armed bandit problem with rotting rewards, where arm rewards decrease over time, and establishes a worst-case regret lower bound of Ω(max{ϱ^{1/3}T, √T}) and matching upper bounds with algorithms using UCB indices and thresholds.

We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $Ω(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case regret lower bound where $T$ is the horizon time. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes