LGMay 15, 2023

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs

Dirk van der Hoeven, Lukas Zierahn, Tal Lancewicki, Aviv Rosenberg, Nicoló Cesa-Bianchi

arXiv:2305.08629v114.314 citations

Originality Incremental advance

AI Analysis

This work addresses delayed feedback challenges in bandit learning, offering theoretical improvements for researchers and practitioners in reinforcement learning and online optimization, though it builds incrementally on existing FTRL methods.

The paper tackles the problem of online learning with delayed bandit feedback by analyzing Follow The Regularized Leader (FTRL), deriving optimal regret bounds for combinatorial semi-bandits and adversarial Markov decision processes with delay, and providing an efficient near-optimal algorithm for linear bandits with delay.

We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in three important settings. On the one hand, we derive the first optimal (up to logarithmic factors) regret bounds for combinatorial semi-bandits with delay and adversarial Markov decision processes with delay (and known transition functions). On the other hand, we use our analysis to derive an efficient algorithm for linear bandits with delay achieving near-optimal regret bounds. Our novel regret decomposition shows that FTRL remains stable across multiple rounds under mild assumptions on the Hessian of the regularizer.

View on arXiv PDF

Similar