CVMar 28, 2025

Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance

Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang

arXiv:2503.22225v11 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work addresses temporal consistency issues in portrait editing for applications such as video generation and virtual avatars, representing an incremental improvement over existing methods.

The paper tackles the problem of temporal inconsistency in portrait editing, particularly for talking heads, by introducing the Follow Your Motion (FYM) framework, which learns motion trajectories and uses a dynamic re-weighted attention mechanism to achieve improved temporal consistency in applications like text-driven editing and relighting.

Pre-trained conditional diffusion models have demonstrated remarkable potential in image editing. However, they often face challenges with temporal consistency, particularly in the talking head domain, where continuous changes in facial expressions intensify the level of difficulty. These issues stem from the independent editing of individual images and the inherent loss of temporal continuity during the editing process. In this paper, we introduce Follow Your Motion (FYM), a generic framework for maintaining temporal consistency in portrait editing. Specifically, given portrait images rendered by a pre-trained 3D Gaussian Splatting model, we first develop a diffusion model that intuitively and inherently learns motion trajectory changes at different scales and pixel coordinates, from the first frame to each subsequent frame. This approach ensures that temporally inconsistent edited avatars inherit the motion information from the rendered avatars. Secondly, to maintain fine-grained expression temporal consistency in talking head editing, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark points in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions. Extensive experiments demonstrate that our method outperforms existing approaches in terms of temporal consistency and can be used to optimize and compensate for temporally inconsistent outputs in a range of applications, such as text-driven editing, relighting, and various other applications.

View on arXiv PDF

Similar