LGNov 14, 2023

On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

arXiv:2311.08290v36 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses data inefficiency in on-policy RL for researchers and practitioners, representing an incremental improvement by adapting existing off-policy sampling ideas to policy gradient methods.

The paper tackles the problem of high-variance gradient estimates in on-policy reinforcement learning due to sampling error, introducing an adaptive off-policy sampling method called PROPS that reduces sampling error and increases data efficiency, achieving improved performance on MuJoCo and discrete-action tasks.

On-policy reinforcement learning (RL) algorithms are typically characterized as algorithms that perform policy updates using i.i.d. trajectories collected by the agent's current policy. However, after observing only a finite number of trajectories, such on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to high-variance gradient estimates that yield data-inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error w.r.t. the expected on-policy distribution than on-policy sampling can produce (Zhong et. al, 2022). Motivated by this observation, we introduce an adaptive, off-policy sampling method to reduce sampling error during on-policy policy gradient RL training. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled w.r.t. the current policy. We empirically evaluate PROPS on continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) increases the data efficiency of on-policy policy gradient algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes