CVDec 5, 2025

Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

arXiv:2512.05394v16 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in video generation for AI researchers and practitioners, offering an incremental improvement over existing methods.

The paper tackled the problem of video VAE latent spaces being suboptimal for diffusion training by analyzing spectral properties and proposing two regularizers, resulting in a 3x speedup in text-to-video generation convergence and a 10% gain in video reward.

Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3\times$ speedup in text-to-video generation convergence and a 10\% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes