CVDec 10, 2024

Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation

Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, Gal Chechik

arXiv:2412.07750v37.63 citationsh-index: 20

Originality Incremental advance

AI Analysis

This addresses the challenge of controlling motion without identity transfer in text-to-video generation, which is incremental as it builds on existing models like VideoCrafter2 and WAN 2.1.

The paper tackled the problem of motion, structure, and identity interactions in text-to-video diffusion models, revealing that self-attention query features affect both motion and identity, leading to a zero-shot motion transfer method that is 10 times more efficient than existing approaches and a training-free technique for consistent multi-shot video generation.

Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query (Q) features simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact. Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method - implemented with VideoCrafter2 and WAN 2.1 - that is 10 times more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation, where characters maintain identity across multiple video shots while Q injection enhances motion fidelity.

View on arXiv PDF

Similar