Easy Monotonic Policy Iteration
This addresses a key issue in reinforcement learning for control, providing a practical solution for ensuring monotonic policy improvement, though it is incremental as it builds on prior work on policy improvement bounds.
The paper tackles the problem of policy performance degradation in reinforcement learning with general function approximators by deriving a new policy improvement bound that replaces sup norm terms with an average divergence, resulting in Easy Monotonic Policy Iteration, which guarantees non-decreasing returns and is easy to implement in sample-based frameworks.
A key problem in reinforcement learning for control with general function approximators (such as deep neural networks and other nonlinear functions) is that, for many algorithms employed in practice, updates to the policy or $Q$-function may fail to improve performance---or worse, actually cause the policy performance to degrade. Prior work has addressed this for policy iteration by deriving tight policy improvement bounds; by optimizing the lower bound on policy improvement, a better policy is guaranteed. However, existing approaches suffer from bounds that are hard to optimize in practice because they include sup norm terms which cannot be efficiently estimated or differentiated. In this work, we derive a better policy improvement bound where the sup norm of the policy divergence has been replaced with an average divergence; this leads to an algorithm, Easy Monotonic Policy Iteration, that generates sequences of policies with guaranteed non-decreasing returns and is easy to implement in a sample-based framework.