LGJan 9, 2025

No-Regret Linear Bandits under Gap-Adjusted Misspecification

Chong Liu, Dan Qiao, Ming Yin, Ilija Bogunovic, Yu-Xiang Wang

Princeton

arXiv:2501.05361v14.11 citationsh-index: 24

Originality Highly original

AI Analysis

This work addresses the issue of unavoidable linear regret in optimization problems for machine learning when reward functions are not perfectly linear, offering a more practical model for real-world applications.

The paper tackles the problem of linear bandits with misspecification by introducing a gap-adjusted model that tolerates larger errors in suboptimal regions, showing that LinUCB achieves near-optimal O(√T) regret and proposing a new algorithm with O(√T) regret and deployment efficiency.

This work studies linear bandits under a new notion of gap-adjusted misspecification and is an extension of Liu et al. (2023). When the underlying reward function is not linear, existing linear bandits work usually relies on a uniform misspecification parameter $ε$ that measures the sup-norm error of the best linear approximation. This results in an unavoidable linear regret whenever $ε> 0$. We propose a more natural model of misspecification which only requires the approximation error at each input $x$ to be proportional to the suboptimality gap at $x$. It captures the intuition that, for optimization problems, near-optimal regions should matter more and we can tolerate larger approximation errors in suboptimal regions. Quite surprisingly, we show that the classical LinUCB algorithm -- designed for the realizable case -- is automatically robust against such $ρ$-gap-adjusted misspecification with parameter $ρ$ diminishing at $O(1/(d \sqrt{\log T}))$. It achieves a near-optimal $O(\sqrt{T})$ regret for problems that the best-known regret is almost linear in time horizon $T$. We further advance this frontier by presenting a novel phased elimination-based algorithm whose gap-adjusted misspecification parameter $ρ= O(1/\sqrt{d})$ does not scale with $T$. This algorithm attains optimal $O(\sqrt{T})$ regret and is deployment-efficient, requiring only $\log T$ batches of exploration. It also enjoys an adaptive $O(\log T)$ regret when a constant suboptimality gap exists. Technically, our proof relies on a novel self-bounding argument that bounds the part of the regret due to misspecification by the regret itself, and a new inductive lemma that limits the misspecification error within the suboptimality gap for all valid actions in each batch selected by G-optimal design.

View on arXiv PDF

Similar