AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance
This addresses the need for more flexible and controllable character video generation for applications in animation and media production, representing an incremental improvement over existing methods.
The paper tackles the problem of generating high-quality character videos by flexibly integrating arbitrary source characters into target scenes using pose guidance, achieving superior results compared to previous state-of-the-art methods.
Character video generation is a significant real-world application focused on producing high-quality videos featuring specific characters. Recent advancements have introduced various control signals to animate static characters, successfully enhancing control over the generation process. However, these methods often lack flexibility, limiting their applicability and making it challenging for users to synthesize a source character into a desired target scene. To address this issue, we propose a novel framework, AnyCharV, that flexibly generates character videos using arbitrary source characters and target scenes, guided by pose information. Our approach involves a two-stage training process. In the first stage, we develop a base model capable of integrating the source character with the target scene using pose guidance. The second stage further bootstraps controllable generation through a self-boosting mechanism, where we use the generated video in the first stage and replace the fine mask with the coarse one, enabling training outcomes with better preservation of character details. Extensive experimental results demonstrate the superiority of our method compared with previous state-of-the-art methods.