Bandits with Side Observations: Bounded vs. Logarithmic Regret
This work addresses a fundamental limitation in bandit algorithms for scenarios with infrequent extra observations, offering theoretical guarantees that could benefit applications in online learning and decision-making under uncertainty.
The paper tackles the problem of stochastic multi-armed bandits with occasional free side observations, proving that an agent can achieve a regret uniformly bounded in time regardless of the observation frequency. The result includes an algorithm with regret bounded by a sum of terms involving log(1/ε)/Δ_i, up to constants, and a matching lower bound.
We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency $ε$, an extra observation is gathered by the agent for free. We prove that, no matter how small $ε$ is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than $\sum_i \frac{\log(1/ε)}{Δ_i}$, up to multiplicative constant and loglog terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.