CVJun 25, 2024

MotionBooth: Motion-Aware Customized Text-to-Video Generation

arXiv:2406.17758v3101 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of animating specific objects with controlled movements for video generation applications, representing an incremental advancement in text-to-video customization.

MotionBooth tackles the problem of generating customized text-to-videos with precise control over object and camera motions by fine-tuning a model with a few images, achieving superior performance in preserving subject appearance and motion control as shown in evaluations.

In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Our project page is at https://jianzongwu.github.io/projects/motionbooth

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes