LGAICLMay 20

Value-Gradient Hypothesis of RL for LLMs

arXiv:2605.2165435.9
AI Analysis

For researchers and practitioners using RL for LLM fine-tuning, this work offers theoretical insight into the effectiveness of critic-free methods and a practical criterion for predicting RL gains.

The paper develops a value-gradient perspective to explain why critic-free RL methods like PPO and GRPO work for LLM post-training, showing that the actor update approximates value gradients under certain conditions. It provides a criterion for when RL should be most effective based on value gradient signal and reachable reward headroom.

Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes