CVLGMMNov 18, 2025

DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

arXiv:2511.14530v13 citations
Originality Incremental advance
AI Analysis

This work addresses the issue of inefficient video representation for machine learning applications, though it appears incremental as it builds on existing VAE frameworks with a novel decomposition approach.

The paper tackles the problem of redundant latent modeling in video Variational Autoencoders by proposing DeCo-VAE, which decomposes video content into keyframe, motion, and residual components with dedicated encoders, achieving superior video reconstruction performance as demonstrated in experiments.

Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes