CVJul 18, 2025

DiViD: Disentangled Video Diffusion for Static-Dynamic Factorization

arXiv:2507.13934v26.21 citationsh-index: 52025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Incremental advance

AI Analysis

This work addresses the problem of video analysis and generation for researchers and practitioners by providing a more effective method for disentangling static and dynamic components, though it appears incremental as it builds on existing diffusion and disentanglement techniques.

The paper tackled the challenge of unsupervised disentanglement of static appearance and dynamic motion in video, which often suffers from information leakage and blurry reconstructions in existing methods, and introduced DiViD, a video diffusion framework that outperformed state-of-the-art sequential disentanglement methods by achieving the highest swap-based joint accuracy, preserving static fidelity while improving dynamic transfer, and reducing average cross-leakage.

Unsupervised disentanglement of static appearance and dynamic motion in video remains a fundamental challenge, often hindered by information leakage and blurry reconstructions in existing VAE- and GAN-based approaches. We introduce DiViD, the first end-to-end video diffusion framework for explicit static-dynamic factorization. DiViD's sequence encoder extracts a global static token from the first frame and per-frame dynamic tokens, explicitly removing static content from the motion code. Its conditional DDPM decoder incorporates three key inductive biases: a shared-noise schedule for temporal consistency, a time-varying KL-based bottleneck that tightens at early timesteps (compressing static information) and relaxes later (enriching dynamics), and cross-attention that routes the global static token to all frames while keeping dynamic tokens frame-specific. An orthogonality regularizer further prevents residual static-dynamic leakage. We evaluate DiViD on real-world benchmarks using swap-based accuracy and cross-leakage metrics. DiViD outperforms state-of-the-art sequential disentanglement methods: it achieves the highest swap-based joint accuracy, preserves static fidelity while improving dynamic transfer, and reduces average cross-leakage.

View on arXiv PDF

Similar