DiffPro: Joint Timestep and Layer-Wise Precision Optimization for Efficient Diffusion Inference
This addresses the efficiency problem for deploying diffusion models in real-time applications, offering incremental improvements through hardware-aware optimization.
The paper tackles the high computational cost of diffusion model inference by proposing DiffPro, a post-training framework that jointly optimizes timesteps and per-layer precision, achieving up to 6.25x model compression, 50% fewer timesteps, and 2.8x faster inference with minimal quality loss.
Diffusion models produce high quality images but inference is costly due to many denoising steps and heavy matrix operations. We present DiffPro, a post-training, hardware-faithful framework that works with the exact integer kernels used in deployment and jointly tunes timesteps and per-layer precision in Diffusion Transformers (DiTs) to reduce latency and memory without any training. DiffPro combines three parts: a manifold-aware sensitivity metric to allocate weight bits, dynamic activation quantization to stabilize activations across timesteps, and a budgeted timestep selector guided by teacher-student drift. In experiments DiffPro achieves up to 6.25x model compression, fifty percent fewer timesteps, and 2.8x faster inference with Delta FID <= 10 on standard benchmarks, demonstrating practical efficiency gains. DiffPro unifies step reduction and precision planning into a single budgeted deployable plan for real-time energy-aware diffusion inference.