LGFeb 26, 2023

No-Regret Linear Bandits beyond Realizability

Princeton
arXiv:2302.13252v23 citationsh-index: 13
Originality Highly original
AI Analysis

This addresses the issue of unavoidable linear regret in bandit optimization for non-linear rewards, offering a more robust model for practitioners in sequential decision-making.

The paper tackles the problem of linear bandits with non-linear reward functions by proposing a gap-adjusted misspecification model, showing that the LinUCB algorithm achieves near-optimal sqrt(T) regret, improving from almost linear regret in prior work.

We study linear bandits when the underlying reward function is not linear. Existing work relies on a uniform misspecification parameter $ε$ that measures the sup-norm error of the best linear approximation. This results in an unavoidable linear regret whenever $ε> 0$. We describe a more natural model of misspecification which only requires the approximation error at each input $x$ to be proportional to the suboptimality gap at $x$. It captures the intuition that, for optimization problems, near-optimal regions should matter more and we can tolerate larger approximation errors in suboptimal regions. Quite surprisingly, we show that the classical LinUCB algorithm -- designed for the realizable case -- is automatically robust against such gap-adjusted misspecification. It achieves a near-optimal $\sqrt{T}$ regret for problems that the best-known regret is almost linear in time horizon $T$. Technically, our proof relies on a novel self-bounding argument that bounds the part of the regret due to misspecification by the regret itself.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes