LGJun 2

Self-Distilled Policy Gradient

Yifeng Liu, Shiyuan Zhang, Yifan Zhang, Quanquan Gu

arXiv:2606.0403689.0Has Code

AI Analysis

For researchers working on reinforcement learning with language models, SDPG offers a more stable and performant training method by integrating self-distillation as an auxiliary loss.

SDPG combines group-relative verifier advantages, exact full-vocabulary on-policy self-distillation, and reference-policy KL regularization to improve stability and performance over RLVR and self-distillation baselines in sparse-reward reinforcement learning for language models.

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.

View on arXiv PDF Code

Similar