CVAILGDec 10, 2024

Video Motion Transfer with Diffusion Transformers

arXiv:2412.07776v233 citationsh-index: 29CVPR
Originality Incremental advance
AI Analysis

This work addresses video motion transfer for content creation, offering a training-free method that improves zero-shot capabilities, though it is incremental as it builds on existing Diffusion Transformer frameworks.

The authors tackled video motion transfer by proposing DiTFlow, which extracts motion signals from cross-frame attention in Diffusion Transformers and uses an optimization-based approach to guide video generation, outperforming existing methods in metrics and human evaluation.

We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes