CVAIIVJan 9, 2025

Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces

arXiv:2501.05442v22 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in video diffusion models for efficient video generation, representing an incremental improvement over existing methods.

The paper tackles the challenge of achieving high temporal compression in video tokenizers without increasing channel capacity by proposing a bootstrapped progressive training method, which significantly improves reconstruction quality and enables high-quality video generation with a reduced token budget.

Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to directly training the full model. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a significantly reduced token budget.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes