LGMay 12

Delightful Gradients Accelerate Corner Escape

arXiv:2605.1190883.3

AI Analysis

For reinforcement learning practitioners, DG offers a simple fix to the corner-trapping problem that can cause exponentially slow convergence in policy gradient methods.

Delightful Policy Gradient (DG) modifies the policy gradient by gating each term with the product of advantage and action surprisal, eliminating exponential slow-down near sub-optimal corners in bandits and tabular MDPs while maintaining O(1/t) convergence. In MNIST bandits with shared neural networks, DG recovers faster from bad initializations than standard policy gradient.

Softmax policy gradient converges at $O(1/t)$, but its transient behavior near sub-optimal corners of the simplex can be exponentially slow. The bottleneck is self-trapping: negative-advantage actions reinforce the corner policy and can initially push the optimal action backward. We study \emph{Delightful Policy Gradient} (DG), which gates each policy-gradient term by the product of advantage and action surprisal. For $K$-armed bandits, we prove that the zero-temperature limit of DG removes this corner-trapping mechanism on a quantitative sector near any sub-optimal corner, yielding a first-exit escape bound logarithmic in the initial probability ratio. At every fixed temperature, the same local mechanism persists because harmful actions are polynomially suppressed as they become rare. A key structural insight is that every action better than the corner action is an \emph{ally}: its contribution to escape is non-negative. Combining corner instability with a monotonic value improvement identity, we prove that DG converges globally to the optimal policy in both bandits and tabular MDPs at an asymptotic $O(1/t)$ rate. We also show, via an exact counterexample, that this tabular mechanism can fail under shared function approximation. In MNIST contextual bandits with a shared-parameter neural network, DG nevertheless recovers from bad initializations faster than standard policy gradient, suggesting that the counterexample marks a boundary of the theory rather than a practical prohibition.

View on arXiv PDF

Similar