PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation
This addresses the problem of identity drift and short clip limitations in video generation for applications requiring precise human motion control.
The paper tackles the challenge of generating long, temporally coherent videos with precise control over subject identity and motion by introducing PoseGen, a framework that generates arbitrarily long videos from a single reference image and pose sequence, achieving significant improvements in identity fidelity and pose accuracy over state-of-the-art methods.
Generating long, temporally coherent videos with precise control over subject identity and motion is a formidable challenge for current diffusion models, which often suffer from identity drift and are limited to short clips. We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence. Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, PoseGen pioneers an interleaved segment generation method that seamlessly stitches video clips together, using a shared KV cache mechanism and a specialized transition process to ensure background consistency and temporal smoothness. Trained on a remarkably small 33-hour video dataset, extensive experiments show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.