LGCLMay 24, 2025

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

arXiv:2505.18830v126 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses a specific misalignment issue in RL for LLMs, offering an incremental improvement for enhancing reasoning capabilities.

The paper tackles the problem of Lazy Likelihood Displacement (LLD) in Group Relative Policy Optimization (GRPO) for reinforcement learning with large language models, where correct response likelihoods stagnate or drop during training, and introduces NTHR to mitigate this, resulting in consistent performance gains on math reasoning benchmarks across models from 0.5B to 3B parameters.

Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes