CVAILGDec 23, 2024

VidTwin: Video VAE with Decoupled Structure and Dynamics

Peking U
arXiv:2412.17726v211 citationsh-index: 12CVPR
Originality Highly original
AI Analysis

This work addresses video compression and generation for AI and multimedia applications, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackled video generation by proposing VidTwin, a video autoencoder that decouples structure and dynamics into separate latent spaces, achieving a compression rate of 0.20% and a PSNR of 28.14 on the MCL-JCV dataset.

Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Check our project page for more details: https://vidtwin.github.io/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes