LGMLJul 9, 2018

Bandit Online Learning with Unknown Delays

arXiv:1807.03205v348 citations
Originality Highly original
AI Analysis

This addresses a critical bottleneck in online learning for applications like recommendation systems or network optimization where feedback delays are unpredictable, offering a novel solution to an intractable problem.

The paper tackles bandit online learning problems with unknown feedback delays in multi-armed bandit and bandit convex optimization settings, developing DEXP3 and DBGD algorithms that achieve regret bounds of O(√(Kd̄(T+D))) and O(√(K(T+D))), respectively, with performance validated on synthetic and real data.

This paper deals with bandit online learning problems involving feedback of unknown delay that can emerge in multi-armed bandit (MAB) and bandit convex optimization (BCO) settings. MAB and BCO require only values of the objective function involved that become available through feedback, and are used to estimate the gradient appearing in the corresponding iterative algorithms. Since the challenging case of feedback with \emph{unknown} delays prevents one from constructing the sought gradient estimates, existing MAB and BCO algorithms become intractable. For such challenging setups, delayed exploration, exploitation, and exponential (DEXP3) iterations, along with delayed bandit gradient descent (DBGD) iterations are developed for MAB and BCO, respectively. Leveraging a unified analysis framework, it is established that the regret of DEXP3 and DBGD are ${\cal O}\big( \sqrt{K\bar{d}(T+D)} \big)$ and ${\cal O}\big( \sqrt{K(T+D)} \big)$, respectively, where $\bar{d}$ is the maximum delay and $D$ denotes the delay accumulated over $T$ slots. Numerical tests using both synthetic and real data validate the performance of DEXP3 and DBGD.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes