LG MLFeb 13, 2024

Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring

arXiv:2402.08321v26.43 citationsh-index: 16ICML

Originality Incremental advance

AI Analysis

This work addresses a key challenge in online learning for researchers and practitioners by improving regret bounds in partial monitoring, though it is incremental as it builds on existing exploration by optimization methods.

The paper tackles the problem of achieving optimal regret bounds in both stochastic and adversarial environments for partial monitoring, a framework of online decision-making with limited feedback, by introducing a new exploration by optimization framework with hybrid regularizers, resulting in a stochastic regret bound that is roughly Θ(k² log T) times smaller than existing best-of-both-worlds algorithms and providing the first O(log T) stochastic bound for globally observable games.

Partial monitoring is a generic framework of online decision-making problems with limited feedback. To make decisions from such limited feedback, it is necessary to find an appropriate distribution for exploration. Recently, a powerful approach for this purpose, \emph{exploration by optimization} (ExO), was proposed, which achieves optimal bounds in adversarial environments with follow-the-regularized-leader for a wide range of online decision-making problems. However, a naive application of ExO in stochastic environments significantly degrades regret bounds. To resolve this issue in locally observable games, we first establish a new framework and analysis for ExO with a hybrid regularizer. This development allows us to significantly improve existing regret bounds of best-of-both-worlds (BOBW) algorithms, which achieves nearly optimal bounds both in stochastic and adversarial environments. In particular, we derive a stochastic regret bound of $O(\sum_{a \neq a^*} k^2 m^2 \log T / Δ_a)$, where $k$, $m$, and $T$ are the numbers of actions, observations and rounds, $a^*$ is an optimal action, and $Δ_a$ is the suboptimality gap for action $a$. This bound is roughly $Θ(k^2 \log T)$ times smaller than existing BOBW bounds. In addition, for globally observable games, we provide a new BOBW algorithm with the first $O(\log T)$ stochastic bound.

View on arXiv PDF

Similar