ARLGApr 11, 2025

MixDiT: Accelerating Image Diffusion Transformer Inference with Mixed-Precision MX Quantization

arXiv:2504.08398v14 citationsh-index: 3IEEE computer architecture letters
Originality Incremental advance
AI Analysis

This addresses the problem of slow DiT inference for users in image generation, but it is incremental as it builds on prior quantization work.

The paper tackles the high computational cost and latency of Diffusion Transformer (DiT) inference by proposing MixDiT, a mixed-precision quantization method that achieves speedups of 2.10-5.32 times over an RTX 3090 with no loss in FID.

Diffusion Transformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiT quantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiT accelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiT delivers a speedup of 2.10-5.32 times over RTX 3090, with no loss in FID.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes