LGFeb 26, 2023

No-Regret Linear Bandits beyond Realizability

Princeton

arXiv:2302.13252v25.33 citationsh-index: 13

Originality Highly original

AI Analysis

This addresses the issue of unavoidable linear regret in bandit optimization for non-linear rewards, offering a more robust model for practitioners in sequential decision-making.

The paper tackles the problem of linear bandits with non-linear reward functions by proposing a gap-adjusted misspecification model, showing that the LinUCB algorithm achieves near-optimal sqrt(T) regret, improving from almost linear regret in prior work.

We study linear bandits when the underlying reward function is not linear. Existing work relies on a uniform misspecification parameter $ε$ that measures the sup-norm error of the best linear approximation. This results in an unavoidable linear regret whenever $ε> 0$. We describe a more natural model of misspecification which only requires the approximation error at each input $x$ to be proportional to the suboptimality gap at $x$. It captures the intuition that, for optimization problems, near-optimal regions should matter more and we can tolerate larger approximation errors in suboptimal regions. Quite surprisingly, we show that the classical LinUCB algorithm -- designed for the realizable case -- is automatically robust against such gap-adjusted misspecification. It achieves a near-optimal $\sqrt{T}$ regret for problems that the best-known regret is almost linear in time horizon $T$. Technically, our proof relies on a novel self-bounding argument that bounds the part of the regret due to misspecification by the regret itself.

View on arXiv PDF

Similar