Bounded Memory Adversarial Bandits with Composite Anonymous Delayed Feedback
This addresses a harder non-oblivious setting in bandit problems with delayed feedback, providing theoretical bounds for researchers in online learning and optimization.
The paper tackles the adversarial bandit problem with composite anonymous delayed feedback in a non-oblivious setting, showing it incurs Ω(T) pseudo regret but proposing a wrapper algorithm achieving O(T^{2/3}) policy regret for K-armed bandit and bandit convex optimization, with a matching lower bound.
We study the adversarial bandit problem with composite anonymous delayed feedback. In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen. And in each round, the algorithm observes the aggregation of losses that come from the latest $d$ rounds. Previous works focus on oblivious adversarial setting, while we investigate the harder non-oblivious setting. We show non-oblivious setting incurs $Ω(T)$ pseudo regret even when the loss sequence is bounded memory. However, we propose a wrapper algorithm which enjoys $o(T)$ policy regret on many adversarial bandit problems with the assumption that the loss sequence is bounded memory. Especially, for $K$-armed bandit and bandit convex optimization, we have $\mathcal{O}(T^{2/3})$ policy regret bound. We also prove a matching lower bound for $K$-armed bandit. Our lower bound works even when the loss sequence is oblivious but the delay is non-oblivious. It answers the open problem proposed in \cite{wang2021adaptive}, showing that non-oblivious delay is enough to incur $\tildeΩ(T^{2/3})$ regret.