SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
This addresses the problem of efficiently training diffusion language models for AI researchers and practitioners, though it is incremental as it builds on existing policy gradient methods.
The paper tackles the challenge of aligning diffusion large language models with human preferences via reinforcement learning by proposing the Sandwiched Policy Gradient (SPG), which uses upper and lower bounds of log-likelihood to reduce bias, resulting in accuracy improvements of up to 27.0% over baselines on tasks like Sudoku and GSM8K.
Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.