LGAIOct 29, 2025

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

arXiv:2510.25818v13 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses a practical limitation for users of diffusion models by enabling higher-resolution image synthesis efficiently, though it is incremental as it builds on existing training-free methods.

The paper tackles the problem of degraded performance in text-to-image diffusion models when generating images beyond training resolution, proposing ScaleDiff, a model-agnostic framework that achieves state-of-the-art performance in image quality and inference speed without additional training.

Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes