LGMLMay 1, 2023

First- and Second-Order Bounds for Adversarial Linear Contextual Bandits

arXiv:2305.00832v317 citations
Originality Incremental advance
AI Analysis

This work provides theoretical improvements for online learning in adversarial environments with contextual information, though it is incremental as it builds on existing worst-case bounds by adding distributional assumptions.

The paper tackles the adversarial linear contextual bandit problem, where loss functions change arbitrarily over time, and derives improved regret bounds under a log-concave context distribution assumption, achieving a second-order bound of order ˜O(K√dV_T) and a first-order bound of order ˜O(K√dL_T^*), which can be significantly better than the worst-case ˜O(√KdT) in benign environments.

We consider the adversarial linear contextual bandit setting, which allows for the loss functions associated with each of $K$ arms to change over time without restriction. Assuming the $d$-dimensional contexts are drawn from a fixed known distribution, the worst-case expected regret over the course of $T$ rounds is known to scale as $\tilde O(\sqrt{Kd T})$. Under the additional assumption that the density of the contexts is log-concave, we obtain a second-order bound of order $\tilde O(K\sqrt{d V_T})$ in terms of the cumulative second moment of the learner's losses $V_T$, and a closely related first-order bound of order $\tilde O(K\sqrt{d L_T^*})$ in terms of the cumulative loss of the best policy $L_T^*$. Since $V_T$ or $L_T^*$ may be significantly smaller than $T$, these improve over the worst-case regret whenever the environment is relatively benign. Our results are obtained using a truncated version of the continuous exponential weights algorithm over the probability simplex, which we analyse by exploiting a novel connection to the linear bandit setting without contexts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes