Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning
This work addresses delayed feedback challenges in online learning, offering a general solution with applications to bandit scenarios, though it is incremental in extending existing OMD techniques.
The paper tackles the problem of delayed feedback in online bandit learning by proposing Banker-OMD, a framework that decouples delay handling from algorithm design, achieving regret bounds of $\widetilde{\mathcal O}(\sqrt{T} + \sqrt{D})$ and leading to the first algorithms for delayed scale-free adversarial MABs and adversarial linear bandits with specific regret improvements.
We propose Banker Online Mirror Descent (Banker-OMD), a novel framework generalizing the classical Online Mirror Descent (OMD) technique in the online learning literature. The Banker-OMD framework almost completely decouples feedback delay handling and the task-specific OMD algorithm design, thus facilitating the design of new algorithms capable of efficiently and robustly handling feedback delays. Specifically, it offers a general methodology for achieving $\widetilde{\mathcal O}(\sqrt{T} + \sqrt{D})$-style regret bounds in online bandit learning tasks with delayed feedback, where $T$ is the number of rounds and $D$ is the total feedback delay. We demonstrate the power of \texttt{Banker-OMD} by applications to two important bandit learning scenarios with delayed feedback, including delayed scale-free adversarial Multi-Armed Bandits (MAB) and delayed adversarial linear bandits. \texttt{Banker-OMD} leads to the first delayed scale-free adversarial MAB algorithm achieving $\widetilde{\mathcal O}(\sqrt{K}L(\sqrt T+\sqrt D))$ regret and the first delayed adversarial linear bandit algorithm achieving $\widetilde{\mathcal O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret. As a corollary, the first application also implies $\widetilde{\mathcal O}(\sqrt{KT}L)$ regret for non-delayed scale-free adversarial MABs, which is the first to match the $Ω(\sqrt{KT}L)$ lower bound up to logarithmic factors and can be of independent interest.