LGMay 6, 2022

Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization

arXiv:2205.02976v24.62 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of inefficient policy gradient optimization in low-data and online control scenarios for reinforcement learning practitioners, representing an incremental improvement over previous trajectory reuse methods.

The paper tackles the limitation of existing trajectory-based policy gradient methods that require complete episodes by proposing an approach that selectively reuses partial trajectories (per-step or per-decision observations) to improve policy gradient estimation. The result is accelerated learning and improved convergence, demonstrated empirically to enhance state-of-the-art policy optimization methods like actor-critic and proximal policy optimization.

Built on our previous study on green simulation assisted policy gradient (GS-PG) focusing on trajectory-based reuse, in this paper, we consider infinite-horizon Markov Decision Processes and create a new importance sampling based policy gradient optimization approach to support dynamic decision making. The existing GS-PG method was designed to learn from complete episodes or process trajectories, which limits its applicability to low-data situations and flexible online process control. To overcome this limitation, the proposed approach can selectively reuse the most related partial trajectories, i.e., the reuse unit is based on per-step or per-decision historical observations. In specific, we create a mixture likelihood ratio (MLR) based policy gradient optimization that can leverage the information from historical state-action transitions generated under different behavioral policies. The proposed variance reduction experience replay (VRER) approach can intelligently select and reuse most relevant transition observations, improve the policy gradient estimation, and accelerate the learning of optimal policy. Our empirical study demonstrates that it can improve optimization convergence and enhance the performance of state-of-the-art policy optimization approaches such as actor-critic method and proximal policy optimizations.

View on arXiv PDF Code

Similar