LGJan 31, 2022

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

arXiv:2201.13172v226 citations
AI Analysis

This addresses the challenge of delayed feedback in reinforcement learning for agents in adversarial environments, representing a significant theoretical advance rather than an incremental improvement.

The paper tackles the problem of online learning in adversarial Markov decision processes with delayed bandit feedback, achieving near-optimal regret of sqrt(K + D), which improves upon the previous best bound of (K + D)^{2/3}.

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes