CVIVMar 18, 2025

LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

arXiv:2503.14325v111 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses efficiency issues for researchers and practitioners scaling video generation models, though it is incremental as it builds on existing Video VAE methods.

The paper tackles the computational bottleneck of Video VAEs in Latent Video Diffusion Models by proposing LeanVAE, an ultra-efficient framework that reduces FLOPs by up to 50x and speeds up inference by 44x while maintaining competitive reconstruction quality.

Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space. However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs. Our model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation. Our models and code are available at https://github.com/westlake-repl/LeanVAE

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes