Jack Mayo

LGAug 16, 2025

An Improved Algorithm for Adversarial Linear Contextual Bandits via Reduction

Tim van Erven, Jack Mayo, Julia Olkhovskaya et al.

We present an efficient algorithm for linear contextual bandits with adversarial losses and stochastic action sets. Our approach reduces this setting to misspecification-robust adversarial linear bandits with fixed action sets. Without knowledge of the context distribution or access to a context simulator, the algorithm achieves $\tilde{O}(\min\{d^2\sqrt{T}, \sqrt{d^3T\log K}\})$ regret and runs in $\text{poly}(d,C,T)$ time, where $d$ is the feature dimension, $C$ is an upper bound on the number of linear constraints defining the action set in each round, $K$ is an upper bound on the number of actions in each round, and $T$ is number of rounds. This resolves the open question by Liu et al. (2023) on whether one can obtain $\text{poly}(d)\sqrt{T}$ regret in polynomial time independent of the number of actions. For the important class of combinatorial bandits with adversarial losses and stochastic action sets where the action sets can be described by a polynomial number of linear constraints, our algorithm is the first to achieve $\text{poly}(d)\sqrt{T}$ regret in polynomial time, while no prior algorithm achieves even $o(T)$ regret in polynomial time to our knowledge. When a simulator is available, the regret bound can be improved to $\tilde{O}(d\sqrt{L^\star})$, where $L^\star$ is the cumulative loss of the best policy.

LGMay 1, 2023

First- and Second-Order Bounds for Adversarial Linear Contextual Bandits

Julia Olkhovskaya, Jack Mayo, Tim van Erven et al.

We consider the adversarial linear contextual bandit setting, which allows for the loss functions associated with each of $K$ arms to change over time without restriction. Assuming the $d$-dimensional contexts are drawn from a fixed known distribution, the worst-case expected regret over the course of $T$ rounds is known to scale as $\tilde O(\sqrt{Kd T})$. Under the additional assumption that the density of the contexts is log-concave, we obtain a second-order bound of order $\tilde O(K\sqrt{d V_T})$ in terms of the cumulative second moment of the learner's losses $V_T$, and a closely related first-order bound of order $\tilde O(K\sqrt{d L_T^*})$ in terms of the cumulative loss of the best policy $L_T^*$. Since $V_T$ or $L_T^*$ may be significantly smaller than $T$, these improve over the worst-case regret whenever the environment is relatively benign. Our results are obtained using a truncated version of the continuous exponential weights algorithm over the probability simplex, which we analyse by exploiting a novel connection to the linear bandit setting without contexts.

Jack Mayo

2 Papers