CVSep 10, 2025

Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video

arXiv:2509.08376v12 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses video analysis and generation by providing a method to separate dynamic and static components, which is incremental as it builds on existing disentanglement concepts with a novel bitrate control approach.

The authors tackled the problem of disentangling motion and content in video data by proposing a self-supervised framework using a transformer-based architecture with low-bitrate vector quantization, achieving results validated on talking head videos and 2D cartoon characters.

We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes