On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
This work addresses the problem of high memory and inference costs in deploying language models for coding tasks, offering insights for efficient deployment, though it is incremental as it builds on existing quantization techniques applied to a newer model type.
The paper investigates the robustness of diffusion-based language models (d-LLMs) to post-training quantization (PTQ) in coding tasks, finding that CoDA, a diffusion-based model, shows greater resilience at low bitwidths (2-4 bits) with smaller accuracy degradation on HumanEval and MBPP benchmarks compared to an auto-regressive counterpart.
Auto-regressive Large Language Models (LLMs) achieve strong performance on coding tasks, but incur high memory and inference costs. Diffusion-based language models (d-LLMs) offer bounded inference cost via iterative denoising, but their behavior under post-training quantization (PTQ) has been sparsely explored. We investigate the application and robustness of PTQ techniques, specifically GPTQ and a modified Hessian-Aware Quantization (HAWQ) algorithm, on a diffusion-based coding LLM (CoDA) and observe that these methods applied to CoDA exhibit greater robustness at low bitwidths compared to Qwen3-1.7B, its auto-regressive counterpart, under a standardized evaluation pipeline. We find that in our setup, CoDA exhibits greater robustness at low bitwidths (2-4 bits), with smaller accuracy degradation across HumanEval and MBPP benchmarks. Additionally, mixed-precision configurations derived from HAWQ provide smooth trade-offs across accuracy, latency, and memory. The results suggest that diffusion LLMs may offer advantages for efficient deployment due to more quantization-resilience.