Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
This addresses the inference latency bottleneck for large language model users by enabling faster parallel decoding while maintaining quality, though it builds incrementally on existing multi-token generation approaches.
The paper tackles the problem of accelerating transformer-based large model inference by introducing Jacobi Forcing, a progressive distillation paradigm that trains models on their own parallel decoding trajectories to shift autoregressive models into efficient parallel decoders while preserving causal inference properties, achieving up to 4.5x higher token acceptance per iteration and 3.8-4.0x wall-clock speedup on coding and math benchmarks with minimal performance loss.
Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.