CVMay 11

Improving Human Image Animation via Semantic Representation Alignment

Chang Liu, Mengting Chen, Yixuan Huang, Haoning Wu, Chen Ju, Shuai Xiao, Jinsong Lan, Yanfeng Wang

arXiv:2605.1052387.8

Predicted impact top 18% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For researchers in image-to-video generation, this work addresses persistent artifacts in human animation, offering a novel alignment approach that enhances coherence without sacrificing flexibility.

SemanticREPA improves human image animation by aligning semantic representations (structure and ID) with depth and face recognition features, reducing limb twisting and facial distortion. It achieves superior quality on extended motions and character consistency.

The field of image-to-video generation has made remarkable progress. However, challenges such as human limb twisting and facial distortion persist, especially when generating long videos or modeling intensive motions. Existing human image animation works address these issues by incorporating human-specific semantic representations, e.g., dense poses or ID embeddings, as additional conditions. However, conditioning on these representations could decrease the generation flexibility. Moreover, their reliance on RGB pixel supervision also lacks emphasis on learning necessary 3D geometric relationships and temporal coherence. In contrast, we introduce a novel approach named SemanticREPA that leverages these semantic representations as supervision signals through representation alignment. Specifically, we begin by training a structure alignment module that aligns the structure representations obtained from video latents with video depth estimation features. We then fix the pretrained module, and utilize it to provide additional supervision on the structure representations of the diffusion models, achieving structure rectification to generate coherent and stable human structures. Simultaneously, we develop an ID alignment module to align the ID representations of the generated videos to face recognition features. We further propose to use the predicted structure representations to refine identity restoration in relevant regions. With structure and ID alignment, our method demonstrates superior quality on extended character motions and enhanced character consistency.

View on arXiv PDF

Similar