An Improved Relaxation for Oracle-Efficient Adversarial Contextual Bandits
This work addresses the challenge of efficient decision-making in adversarial environments for machine learning practitioners, representing an incremental improvement in regret bounds.
The paper tackles the adversarial contextual bandits problem by proposing an oracle-efficient relaxation algorithm, achieving a regret bound of O(T^{2/3}(K log(|Π|))^{1/3}) with O(K) oracle calls per round, improving upon prior bounds and matching a previous stochastic case result.
We present an oracle-efficient relaxation for the adversarial contextual bandits problem, where the contexts are sequentially drawn i.i.d from a known distribution and the cost sequence is chosen by an online adversary. Our algorithm has a regret bound of $O(T^{\frac{2}{3}}(K\log(|Π|))^{\frac{1}{3}})$ and makes at most $O(K)$ calls per round to an offline optimization oracle, where $K$ denotes the number of actions, $T$ denotes the number of rounds and $Π$ denotes the set of policies. This is the first result to improve the prior best bound of $O((TK)^{\frac{2}{3}}(\log(|Π|))^{\frac{1}{3}})$ as obtained by Syrgkanis et al. at NeurIPS 2016, and the first to match the original bound of Langford and Zhang at NeurIPS 2007 which was obtained for the stochastic case.