CV AI LGDec 10, 2024

Video Motion Transfer with Diffusion Transformers

Alexander Pondaven, Aliaksandr Siarohin, Sergey Tulyakov, Philip Torr, Fabio Pizzati

arXiv:2412.07776v223.333 citationsh-index: 29Has CodeCVPR

Originality Incremental advance

AI Analysis

This work addresses video motion transfer for content creation, offering a training-free method that improves zero-shot capabilities, though it is incremental as it builds on existing Diffusion Transformer frameworks.

The authors tackled video motion transfer by proposing DiTFlow, which extracts motion signals from cross-frame attention in Diffusion Transformers and uses an optimization-based approach to guide video generation, outperforming existing methods in metrics and human evaluation.

We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.

View on arXiv PDF Code

Similar