Adaptation to the Range in $K$-Armed Bandits
This work addresses a fundamental limitation in bandit algorithms for researchers and practitioners, but it is incremental as it builds on known regret bounds by introducing range adaptation.
The paper tackles the problem of stochastic bandit learning when the range of rewards is unknown, showing that this introduces a cost and a new trade-off between distribution-dependent and distribution-free regret bounds, preventing simultaneous achievement of typical logarithmic and square-root rates. They propose a strategy that achieves the regret rates indicated by this trade-off, such as requiring at least order √T distribution-dependent regret for a √T distribution-free bound.
We consider stochastic bandit problems with $K$ arms, each associated with a bounded distribution supported on the range $[m,M]$. We do not assume that the range $[m,M]$ is known and show that there is a cost for learning this range. Indeed, a new trade-off between distribution-dependent and distribution-free regret bounds arises, which prevents from simultaneously achieving the typical $\ln T$ and $\sqrt{T}$ bounds. For instance, a $\sqrt{T}$}distribution-free regret bound may only be achieved if the distribution-dependent regret bounds are at least of order $\sqrt{T}$. We exhibit a strategy achieving the rates for regret indicated by the new trade-off.