GRCVMay 3, 2025

OT-Talk: Animating 3D Talking Head with Optimal Transportation

arXiv:2505.01932v25 citationsh-index: 3ICMR
Originality Incremental advance
AI Analysis

This addresses the challenge of creating natural 3D facial animations for AR/VR, gaming, and entertainment applications, representing an incremental improvement over existing methods.

The paper tackles the problem of animating 3D talking heads from audio inputs by proposing OT-Talk, which uses optimal transportation and Chebyshev Graph Convolution to improve facial motion learning, resulting in state-of-the-art performance on mesh reconstruction accuracy and temporal alignment across two public datasets.

Animating 3D head meshes using audio inputs has significant applications in AR/VR, gaming, and entertainment through 3D avatars. However, bridging the modality gap between speech signals and facial dynamics remains a challenge, often resulting in incorrect lip syncing and unnatural facial movements. To address this, we propose OT-Talk, the first approach to leverage optimal transportation to optimize the learning model in talking head animation. Building on existing learning frameworks, we utilize a pre-trained Hubert model to extract audio features and a transformer model to process temporal sequences. Unlike previous methods that focus solely on vertex coordinates or displacements, we introduce Chebyshev Graph Convolution to extract geometric features from triangulated meshes. To measure mesh dissimilarities, we go beyond traditional mesh reconstruction errors and velocity differences between adjacent frames. Instead, we represent meshes as probability measures and approximate their surfaces. This allows us to leverage the sliced Wasserstein distance for modeling mesh variations. This approach facilitates the learning of smooth and accurate facial motions, resulting in coherent and natural facial animations. Our experiments on two public audio-mesh datasets demonstrate that our method outperforms state-of-the-art techniques both quantitatively and qualitatively in terms of mesh reconstruction accuracy and temporal alignment. In addition, we conducted a user perception study with 20 volunteers to further assess the effectiveness of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes