Relative Entropy Pathwise Policy Optimization
This addresses training stability and efficiency issues for on-policy reinforcement learning practitioners, representing an incremental improvement by combining existing techniques in a novel way.
The paper tackles the high variance and instability in score-function based policy learning methods like REINFORCE and PPO by introducing REPPO, an on-policy algorithm that trains Q-value models from on-policy trajectories, enabling stable pathwise policy updates. The result shows strong empirical performance with superior sample efficiency, wall-clock time, memory footprint, and hyperparameter robustness on standard benchmarks.
Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Using pathwise policy gradients, i.e. computing a derivative by differentiating the objective function, alleviates the variance issues. However, they require an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. The result, Relative Entropy Pathwise Policy Optimization (REPPO), is an efficient on-policy algorithm that combines the stability of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. Compared to state-of-the-art on two standard GPU-parallelized benchmarks, REPPO provides strong empirical performance at superior sample efficiency, wall-clock time, memory footprint, and hyperparameter robustness.