CVNov 16, 2025

Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

arXiv:2511.12633v14 citations
Originality Incremental advance
AI Analysis

This addresses a bottleneck in generative modeling for computer vision by improving training efficiency and quality, though it is incremental as it builds on existing VAE and diffusion model frameworks.

The paper tackles the problem of high-dimensional latent spaces in variational autoencoders (VAEs) hindering generative model training by revealing that redundant high-frequency components degrade convergence and generation quality, and proposes a spectral self-regularization strategy that enables diffusion models to converge 2× faster while achieving state-of-the-art reconstruction (rFID = 0.28, PSNR = 27.26) and competitive generation (gFID = 1.82) on ImageNet 256×256.

Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2$\times$ faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256$\times$256 benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes