T*: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning
This is an incremental improvement for language modeling efficiency.
The paper tackles the problem of scaling masked diffusion language models to larger block sizes for higher-parallelism decoding, achieving minimal performance degradation on math reasoning benchmarks.
We present T*, a simple TraceRL-based training curriculum for progressive block-size scaling in masked diffusion language models (MDMs). Starting from an AR-initialized small-block MDM, T* transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks. Moreover, further analysis suggests that T* may actually converge to an alternative decoding schedule that achieves comparable performance.