CVJan 23, 2025

MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize

arXiv:2501.13349v23 citationsh-index: 9
Originality Highly original
AI Analysis

This work addresses computational bottlenecks in diffusion models for high-resolution image generation, offering a more efficient alternative for applications like content creation, though it is incremental as it builds on existing hierarchical decomposition ideas.

The paper tackles the computational inefficiency of diffusion models for high-resolution image generation by introducing a multi-scale latent factorization framework that decomposes denoising into base and residual signals, achieving FID scores of 2.08 at 256x256 and 2.47 at 512x512 on ImageNet with a 4x speed-up compared to DiT.

While diffusion-based generative models have made significant strides in visual content creation, conventional approaches face computational challenges, especially for high-resolution images, as they denoise the entire image from noisy inputs. This contrasts with signal processing techniques, such as Fourier and wavelet analyses, which often employ hierarchical decompositions. Inspired by such principles, particularly the idea of signal separation, we introduce a diffusion framework leveraging multi-scale latent factorization. Our framework uniquely decomposes the denoising target, typically latent features from a pretrained Variational Autoencoder, into a low-frequency base signal capturing core structural information and a high-frequency residual signal that contributes finer, high-frequency details like textures. This decomposition into base and residual components directly informs our two-stage image generation process, which first produces the low-resolution base, followed by the generation of the high-resolution residual. Our proposed architecture facilitates reduced sampling steps during the residual learning stage, owing to the inherent ease of modeling residual information, which confers advantages over conventional full-resolution generation techniques. This specific approach of decomposing the signal into a base and a residual, conceptually akin to how wavelet analysis can separate different frequency bands, yields a more streamlined and intuitive design distinct from generic hierarchical models. Our method, \name\ (Multi-Scale Factorization), demonstrates its effectiveness by achieving FID scores of 2.08 ($256\times256$) and 2.47 ($512\times512$) on class-conditional ImageNet benchmarks, outperforming the DiT baseline (2.27 and 3.04 respectively) while also delivering a $4\times$ speed-up with the same number of sampling steps.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes