CVSep 30, 2024

Replace Anyone in Videos

Xiang Wang, Shiwei Zhang, Haonan Qiu, Ruihang Chu, Zekun Li, Yingya Zhang, Changxin Gao, Yuehuan Wang, Chunhua Shen, Nong Sang

arXiv:2409.19911v24 citationsh-index: 20Has Code

AI Analysis

This work provides a method for more precise and localized control over human motion in videos, which is beneficial for content creators and video editing applications. It is an incremental improvement in controllable human-centric video generation.

This paper introduces ReplaceAnyone, a framework for localized human replacement and insertion in videos, addressing the challenge of precise control over human motion while preserving intricate backgrounds. It formulates the task as image-conditioned video inpainting with pose guidance, using a unified end-to-end video diffusion architecture. The method effectively replaces or inserts characters while maintaining desired pose motion and reference appearance, demonstrating realistic and coherent video content.

The field of controllable human-centric video generation has witnessed remarkable progress, particularly with the advent of diffusion models. However, achieving precise and localized control over human motion in videos, such as replacing or inserting individuals while preserving desired motion patterns, still remains a formidable challenge. In this work, we present the ReplaceAnyone framework, which focuses on localized human replacement and insertion featuring intricate backgrounds. Specifically, we formulate this task as an image-conditioned video inpainting paradigm with pose guidance, utilizing a unified end-to-end video diffusion architecture that facilitates image-conditioned video inpainting within masked regions. To prevent shape leakage and enable granular local control, we introduce diverse mask forms involving both regular and irregular shapes. Furthermore, we implement an enriched visual guidance mechanism to enhance appearance alignment, a hybrid inpainting encoder to further preserve the detailed background information in the masked video, and a two-phase optimization methodology to simplify the training difficulty. ReplaceAnyone enables seamless replacement or insertion of characters while maintaining the desired pose motion and reference appearance within a single framework. Extensive experimental results demonstrate the effectiveness of our method in generating realistic and coherent video content. The proposed ReplaceAnyone can be seamlessly applied not only to traditional 3D-UNet base models but also to DiT-based video models such as Wan2.1. The code will be available at https://github.com/ali-vilab/UniAnimate-DiT.

View on arXiv PDF Code

Similar