LGMar 26

Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback

arXiv:2603.2502929.8h-index: 15

AI Analysis

This work addresses a theoretical challenge in online learning for researchers, offering a foundational advance by closing a known gap in regret analysis.

The paper tackles the open problem of achieving tight high-probability regret bounds for Online Convex Optimization with two-point bandit feedback in adversarial settings, and resolves it by providing a minimax optimal regret bound of O(d(log T + log(1/δ))/μ) for μ-strongly convex losses.

We consider the problem of Online Convex Optimization (OCO) with two-point bandit feedback in an adversarial environment. In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at two points. While it is well-known that two-point feedback allows for gradient estimation, achieving tight high-probability regret bounds for strongly convex functions still remained open as highlighted by \citet{agarwal2010optimal}. The primary challenge lies in the heavy-tailed nature of bandit gradient estimators, which makes standard concentration analysis difficult. In this paper, we resolve this open challenge by providing the first high-probability regret bound of $O(d(\log T + \log(1/Î´))/Î¼)$ for $Î¼$-strongly convex losses. Our result is minimax optimal with respect to both the time horizon $T$ and the dimension $d$.

View on arXiv PDF

Similar