LG AI MLJun 14, 2021

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

arXiv:2106.07329v118.959 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of optimizing long-term average rewards in reinforcement learning, which is crucial for applications requiring stable performance over time, though it is incremental as it adapts existing methods to a specific criterion.

The paper tackles the problem of deriving meaningful performance bounds for on-policy deep reinforcement learning under the average-reward criterion, showing that previous discounted-return bounds are inadequate and proposing a novel bound based on average divergence and Kemeny's constant, which leads to the development of ATRPO, an algorithm that significantly outperforms TRPO in challenging MuJuCo environments.

We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful bound in the average-reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the two policies and Kemeny's constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic DRL (Deep Reinforcement Learning) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.

View on arXiv PDF

Similar