Refined Policy Improvement Bounds for MDPs
This work addresses a theoretical limitation in reinforcement learning algorithms for researchers and practitioners, but it is incremental as it builds directly on prior results.
The paper tackled the problem of policy improvement bounds in Markov Decision Processes (MDPs) becoming degenerate as the discount factor approaches one, which questions the applicability of algorithms like TRPO; they refined existing bounds to propose a novel bound that is continuous in the discount factor and applicable to MDPs with long-run average rewards.
The policy improvement bound on the difference of the discounted returns plays a crucial role in the theoretical justification of the trust-region policy optimization (TRPO) algorithm. The existing bound leads to a degenerate bound when the discount factor approaches one, making the applicability of TRPO and related algorithms questionable when the discount factor is close to one. We refine the results in \cite{Schulman2015, Achiam2017} and propose a novel bound that is "continuous" in the discount factor. In particular, our bound is applicable for MDPs with the long-run average rewards as well.