CVDec 1, 2025

ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers

arXiv:2512.01426v1h-index: 7
Originality Incremental advance
AI Analysis

This addresses the challenge of efficiently generating high-resolution images for AI art and content creation, though it is incremental as it builds on existing Diffusion Transformers.

The paper tackled the problem of spatial layout collapse and degraded texture fidelity when scaling pre-trained Diffusion Transformers to high-resolution image synthesis, and the result was ResDiT, a training-free method that rectifies positional encodings and enhances local details, achieving high-fidelity synthesis without complex pipelines.

Leveraging pre-trained Diffusion Transformers (DiTs) for high-resolution (HR) image synthesis often leads to spatial layout collapse and degraded texture fidelity. Prior work mitigates these issues with complex pipelines that first perform a base-resolution (i.e., training-resolution) denoising process to guide HR generation. We instead explore the intrinsic generative mechanisms of DiTs and propose ResDiT, a training-free method that scales resolution efficiently. We identify the core factor governing spatial layout, position embeddings (PEs), and show that the original PEs encode incorrect positional information when extrapolated to HR, which triggers layout collapse. To address this, we introduce a PE scaling technique that rectifies positional encoding under resolution changes. To further remedy low-fidelity details, we develop a local-enhancement mechanism grounded in base-resolution local attention. We design a patch-level fusion module that aggregates global and local cues, together with a Gaussian-weighted splicing strategy that eliminates grid artifacts. Comprehensive evaluations demonstrate that ResDiT consistently delivers high-fidelity, high-resolution image synthesis and integrates seamlessly with downstream tasks, including spatially controlled generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes