CVMay 18

Stabilizing, Scaling & Enhancing MeanFlow for Large-scale Diffusion Distillation

Xiao He, Yang Li, Peizhen Zhang, Songtao Liu, Zhao Zhong, Nannan Wang

arXiv:2605.1783470.3

AI Analysis

Enables stable and high-quality few-step distillation for very large text-to-image models, addressing a key bottleneck in practical deployment.

MeanFlow suffers from instability and mean-seeking bias when distilling large-scale diffusion models. The authors propose a warm-up technique and trajectory distribution alignment, achieving superior performance on FLUX.1-dev (12B params) and robust generalization on HunyuanImage 3.0 (80B params).

Diffusion models exhibit remarkable generative capability, but their high latency limits practical deployment. Many studies have attempted to reduce sampling steps to accelerate inference. Among them, MeanFlow has attracted considerable attention due to its concise formulation and remarkable performance. Nevertheless, the instability of its optimization objective and the ''mean-seeking bias'' have limited its applicability to distill large-scale industrial models. To stabilize MeanFlow for distilling large-scale models, we first introduce a warm-up technique, in which the original differential solution of MeanFlow is replaced by a discrete solution. This design avoids training collapse caused by the MeanFlow target containing a stop-gradient term from an undertrained model. Once the model acquires a preliminary ability to fit the average velocity field, we switch the optimization objective back to the differential solution, enabling further refinement. Meanwhile, to alleviate the ''mean-seeking bias'' of MeanFlow under extremely few-step inference with complex target distributions, we incorporate trajectory distribution alignment as an auxiliary objective, encouraging the student model's trajectory distribution to align more closely with that of the teacher model. Our proposed distillation framework achieves superior performance compared to existing distillation approaches when applied to the text-to-image (T2I) model FLUX.1-dev (up to 12B parameters). Furthermore, when extended to the 80B-parameter state-of-the-art (SOTA) T2I model HunyuanImage 3.0, our method continues to demonstrate robust generalization and strong performance.

View on arXiv PDF

Similar