LGAug 16, 2025

An Improved Algorithm for Adversarial Linear Contextual Bandits via Reduction

Tim van Erven, Jack Mayo, Julia Olkhovskaya, Chen-Yu Wei

arXiv:2508.11931v17.11 citationsh-index: 20

Originality Highly original

AI Analysis

This work addresses the problem of efficient algorithms for linear contextual bandits with adversarial losses, which is significant for researchers and practitioners dealing with online decision-making problems with limited feedback.

The authors tackled the problem of linear contextual bandits with adversarial losses and stochastic action sets, achieving a regret of $ ilde{O}(min{d^2sqrt{T}, sqrt{d^3Tlog K}})$ without knowledge of the context distribution. This resolves an open question on obtaining $ ext{poly}(d)sqrt{T}$ regret in polynomial time independent of the number of actions.

We present an efficient algorithm for linear contextual bandits with adversarial losses and stochastic action sets. Our approach reduces this setting to misspecification-robust adversarial linear bandits with fixed action sets. Without knowledge of the context distribution or access to a context simulator, the algorithm achieves $\tilde{O}(\min\{d^2\sqrt{T}, \sqrt{d^3T\log K}\})$ regret and runs in $\text{poly}(d,C,T)$ time, where $d$ is the feature dimension, $C$ is an upper bound on the number of linear constraints defining the action set in each round, $K$ is an upper bound on the number of actions in each round, and $T$ is number of rounds. This resolves the open question by Liu et al. (2023) on whether one can obtain $\text{poly}(d)\sqrt{T}$ regret in polynomial time independent of the number of actions. For the important class of combinatorial bandits with adversarial losses and stochastic action sets where the action sets can be described by a polynomial number of linear constraints, our algorithm is the first to achieve $\text{poly}(d)\sqrt{T}$ regret in polynomial time, while no prior algorithm achieves even $o(T)$ regret in polynomial time to our knowledge. When a simulator is available, the regret bound can be improved to $\tilde{O}(d\sqrt{L^\star})$, where $L^\star$ is the cumulative loss of the best policy.

View on arXiv PDF

Similar