Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone
This addresses the problem of slow text generation for users of diffusion language models, offering a practical and scalable improvement.
The paper tackled the inference efficiency bottleneck in diffusion-based language models by introducing DiffuApriel, which uses a bidirectional Mamba backbone to achieve up to 4.4x higher throughput for long sequences while matching Transformer-based performance.
Diffusion-based language models have recently emerged as a promising alternative to autoregressive generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention and KV-cache overhead. In this work, we introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling. DiffuApriel matches the performance of Transformer-based diffusion models while achieving up to 4.4x higher inference throughput for long sequences with a 1.3B model. We further propose DiffuApriel-H, a hybrid variant that interleaves attention and mamba layers, offering up to 2.6x throughput improvement with balanced global and local context modeling. Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing a practical and scalable foundation for faster, memory-efficient text generation.