LGAIMLFeb 18, 2023

Best of Both Worlds Policy Optimization

arXiv:2302.09408v117 citationsh-index: 26
Originality Incremental advance
AI Analysis

This work addresses the need for more adaptive regret bounds in reinforcement learning, offering a solution that benefits practitioners by improving performance in stochastic environments without compromising worst-case robustness, though it is incremental in refining existing theoretical foundations.

The paper tackles the problem of overly pessimistic worst-case regret bounds in policy optimization for reinforcement learning by designing a method that achieves polylog(T) regret for stochastic losses while maintaining worst-case guarantees for adversarial losses, specifically showing a gap-dependent polylog(T) regret bound for the first time in policy optimization.

Policy optimization methods are popular reinforcement learning algorithms in practice. Recent works have built theoretical foundation for them by proving $\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that in tabular Markov decision processes (MDPs), by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are stochastic, without sacrificing the worst-case guarantee in the adversarial regime. To our knowledge, this is also the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization. Specifically, we achieve this by leveraging a Tsallis entropy or a Shannon entropy regularizer in the policy update. Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log-barrier regularizer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes