CVAILGMay 6

Taming Outlier Tokens in Diffusion Transformers

arXiv:2605.0520686.5
Predicted impact top 20% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers working on diffusion-based image generation, this work addresses an underexplored artifact that limits DiT performance, offering a practical fix.

The paper identifies outlier tokens in Diffusion Transformers (DiTs) that degrade image generation quality, and proposes Dual-Stage Registers (DSR) to mitigate them, achieving consistent improvements across ImageNet and text-to-image tasks.

We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes