CVLGDec 30, 2025

Guiding a Diffusion Transformer with the Internal Dynamics of Itself

arXiv:2512.24176v12 citationsh-index: 6
Originality Highly original
AI Analysis

This work addresses image generation quality for AI applications, offering a novel method that enhances training efficiency and outperforms existing approaches, though it is incremental in the context of diffusion model guidance.

The paper tackles the problem of diffusion models generating low-quality images in low-probability areas by proposing Internal Guidance (IG), a strategy that uses auxiliary supervision during training and extrapolation during sampling, resulting in significant improvements such as FID=1.19 on ImageNet 256x256.

The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes