Delightful Policy Gradient
This addresses inefficiencies in policy gradient methods for reinforcement learning, offering a novel approach to improve training stability and performance, though it appears incremental as it builds on existing gradient techniques.
The paper tackled the problem of standard policy gradients being distorted by rare negative-advantage actions and over-allocating updates to well-handled contexts, and introduced the Delightful Policy Gradient (DG) that gates updates with a sigmoid of advantage and action surprisal, resulting in improved performance over baselines like REINFORCE and PPO across tasks such as MNIST and continuous control.
Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG), which gates each term with a sigmoid of \emph{delight}, the product of advantage and action surprisal (negative log-probability). For $K$-armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even with infinite samples. Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines across MNIST, transformer sequence modeling, and continuous control, with larger gains on harder tasks.