LG MLJun 10, 2020

Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

arXiv:2006.05606v219.761 citations

Originality Incremental advance

AI Analysis

This provides a robust solution for reinforcement learning in mixed stochastic-adversarial environments, though it is incremental as it builds on existing methods for multi-armed bandits.

The paper tackles the problem of learning episodic Markov Decision Processes with known transition and bandit feedback by developing an algorithm that achieves O(log T) regret in stochastic settings and O(sqrt(T)) regret in adversarial settings, with a general O(sqrt(C)) regret for corrupted losses.

This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a ``best-of-both-worlds'' guarantee: it achieves $\mathcal{O}(log T)$ regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with $\tilde{\mathcal{O}}(\sqrt{T})$ regret even when the losses are adversarial, where $T$ is the number of episodes. More generally, it achieves $\tilde{\mathcal{O}}(\sqrt{C})$ regret in an intermediate setting where the losses are corrupted by a total amount of $C$. Our algorithm is based on the Follow-the-Regularized-Leader method from Zimin and Neu (2013), with a novel hybrid regularizer inspired by recent works of Zimmert et al. (2019a, 2019b) for the special case of multi-armed bandits. Crucially, our regularizer admits a non-diagonal Hessian with a highly complicated inverse. Analyzing such a regularizer and deriving a particular self-bounding regret guarantee is our key technical contribution and might be of independent interest.

View on arXiv PDF

Similar