LGAIOCJul 16, 2021

Refined Policy Improvement Bounds for MDPs

arXiv:2107.08068v12 citations
AI Analysis

This work addresses a theoretical limitation in reinforcement learning algorithms for researchers and practitioners, but it is incremental as it builds directly on prior results.

The paper tackled the problem of policy improvement bounds in Markov Decision Processes (MDPs) becoming degenerate as the discount factor approaches one, which questions the applicability of algorithms like TRPO; they refined existing bounds to propose a novel bound that is continuous in the discount factor and applicable to MDPs with long-run average rewards.

The policy improvement bound on the difference of the discounted returns plays a crucial role in the theoretical justification of the trust-region policy optimization (TRPO) algorithm. The existing bound leads to a degenerate bound when the discount factor approaches one, making the applicability of TRPO and related algorithms questionable when the discount factor is close to one. We refine the results in \cite{Schulman2015, Achiam2017} and propose a novel bound that is "continuous" in the discount factor. In particular, our bound is applicable for MDPs with the long-run average rewards as well.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes