LG AIJan 16, 2023

The Role of Baselines in Policy Gradient Optimization

Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, Dale Schuurmans

DeepMindMILA

arXiv:2301.06276v121.131 citationsh-index: 77

Originality Highly original

AI Analysis

This provides a new theoretical understanding for reinforcement learning practitioners, addressing a gap between theory and practice in policy optimization methods.

The paper tackles the role of baselines in policy gradient optimization by showing that the state value baseline enables on-policy stochastic natural policy gradient to converge to a globally optimal policy at an O(1/t) rate, and reveals that its primary effect is to reduce update aggressiveness rather than variance.

We study the effect of baselines in on-policy stochastic policy gradient optimization, and close the gap between the theory and practice of policy optimization methods. Our first contribution is to show that the \emph{state value} baseline allows on-policy stochastic \emph{natural} policy gradient (NPG) to converge to a globally optimal policy at an $O(1/t)$ rate, which was not previously known. The analysis relies on two novel findings: the expected progress of the NPG update satisfies a stochastic version of the non-uniform Łojasiewicz (NŁ) inequality, and with probability 1 the state value baseline prevents the optimal action's probability from vanishing, thus ensuring sufficient exploration. Importantly, these results provide a new understanding of the role of baselines in stochastic policy gradient: by showing that the variance of natural policy gradient estimates remains unbounded with or without a baseline, we find that variance reduction \emph{cannot} explain their utility in this setting. Instead, the analysis reveals that the primary effect of the value baseline is to \textbf{reduce the aggressiveness of the updates} rather than their variance. That is, we demonstrate that a finite variance is \emph{not necessary} for almost sure convergence of stochastic NPG, while controlling update aggressiveness is both necessary and sufficient. Additional experimental results verify these theoretical findings.

View on arXiv PDF

Similar