CVGRLGDec 1, 2024

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer

arXiv:2412.00733v4116 citationsh-index: 8CVPR
Originality Highly original
AI Analysis

This work addresses the problem of creating immersive and dynamic portrait animations for applications in media and entertainment, representing a novel method for a known bottleneck.

The paper tackles the challenge of animating portrait images with non-frontal perspectives, dynamic objects, and realistic backgrounds by introducing a transformer-based video generative model, achieving substantial improvements in generating realistic videos on benchmark and wild datasets.

Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: https://fudan-generative-vision.github.io/hallo3/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes