CVOct 31, 2024

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

arXiv:2410.24219v13 citationsh-index: 3NIPS
Originality Highly original
AI Analysis

This work addresses the challenge of producing dynamic videos from text for applications in AI-generated content, representing a strong specific gain rather than a foundational breakthrough.

The paper tackles the problem of generating videos with realistic motion from text in Text-to-Video (T2V) generation, proposing the DEMO framework that decomposes text encoding and conditioning into content and motion components, resulting in superior motion dynamics and high visual quality on benchmarks like MSR-VTT and VBench.

Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding, which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model's understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions. Project page: https://PR-Ryan.github.io/DEMO-project/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes