Diffusion-State Policy Optimization for Masked Diffusion Language Models
This addresses a specific bottleneck in training diffusion language models for tasks requiring precise intermediate decisions, representing an incremental advancement.
The paper tackles the problem of coarse credit assignment in masked diffusion language models by proposing DiSPO, a method that directly optimizes intermediate filling decisions, resulting in consistent improvements over the terminal-feedback baseline on math and planning benchmarks using LLaDA-8B-Instruct.
Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion-State Policy Optimization), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens -- without additional multi-step diffusion rollouts. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that can be combined with terminal-feedback policy optimization using the same rollouts. On LLaDA-8B-Instruct, DiSPO consistently improves over the terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at https://daioba.github.io/dispo .