LG MLFeb 26, 2025

Bandit and Delayed Feedback in Online Structured Prediction

Yuki Shibukawa, Taira Tsuchiya, Shinsaku Sakaue, Kenji Yamanishi

arXiv:2502.18709v27.11 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses the challenge of unrealistic full-information feedback in online structured prediction, offering practical solutions for scenarios with complex outputs, though it is incremental relative to prior work on surrogate regret bounds.

The paper tackles the problem of online structured prediction with limited feedback, proposing algorithms for bandit and delayed feedback settings. For bandit feedback, they achieve a surrogate regret bound of O(√(KT)) and improve it to O(T^{2/3}) independent of output size K, with numerical comparisons provided.

Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full-information setting, we can achieve finite bounds on the \textit{surrogate regret}, i.e. the extra target loss relative to the best possible surrogate loss. In practice, however, full-information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, bandit and delayed feedback. For bandit feedback, by using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of $O(\sqrt{KT})$ for the time horizon $T$ and the size of the output set $K$. However, $K$ can be extremely large when outputs are highly complex, resulting in an undesirable bound. To address this issue, we propose another algorithm that achieves a surrogate regret bound of $O(T^{2/3})$, which is independent of $K$. This is achieved with a carefully designed pseudo-inverse matrix estimator. Furthermore, we numerically compare the performance of these algorithms, as well as existing ones. Regarding delayed feedback, we provide algorithms and regret analyses that cover various scenarios, including full-information and bandit feedback, as well as fixed and variable delays.

View on arXiv PDF

Similar