Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator
This addresses training instability in reinforcement learning for practitioners, but it is incremental as it builds on existing NSR analysis.
The paper tackled the instability and slowdown in policy-gradient reinforcement learning by analyzing the noise-to-signal ratio (NSR) of the REINFORCE estimator, finding that NSR is non-uniform and often increases near optima, leading to training instability and policy collapse in various examples.
Policy-gradient methods are widely used in reinforcement learning, yet training often becomes unstable or slows down as learning progresses. We study this phenomenon through the noise-to-signal ratio (NSR) of a policy-gradient estimator, defined as the estimator variance (noise) normalized by the squared norm of the true gradient (signal). Our main result is that, for (i) finite-horizon linear systems with Gaussian policies and linear state-feedback, and (ii) finite-horizon polynomial systems with Gaussian policies and polynomial feedback, the NSR of the REINFORCE estimator can be characterized exactly-either in closed form or via numerical moment-evaluation algorithms-without approximation. For general nonlinear dynamics and expressive policies (including neural policies), we further derive a general upper bound on the variance. These characterizations enable a direct examination of how NSR varies across policy parameters and how it evolves along optimization trajectories (e.g. SGD and Adam). Across a range of examples, we find that the NSR landscape is highly non-uniform and typically increases as the policy approaches an optimum; in some regimes it blows up, which can trigger training instability and policy collapse.