CV CGOct 17, 2023

CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation

Zhaojie Chu, Kailing Guo, Xiaofen Xing, Yilin Lan, Bolun Cai, Xiangmin Xu

arXiv:2310.11295v18.413 citationsh-index: 25

Originality Incremental advance

AI Analysis

This work improves 3D facial animation for applications like virtual avatars or gaming by providing more realistic and varied facial movements, though it is incremental in refining existing approaches.

The paper tackles the problem of speech-driven 3D facial animation by addressing the oversimplification in existing methods that map single-level speech features to entire facial movements, leading to overly smoothed animations. The result is that their proposed CorrTalk framework outperforms state-of-the-art methods, as shown through extensive experiments and a user study.

Speech-driven 3D facial animation is a challenging cross-modal task that has attracted growing research interest. During speaking activities, the mouth displays strong motions, while the other facial regions typically demonstrate comparatively weak activity levels. Existing approaches often simplify the process by directly mapping single-level speech features to the entire facial animation, which overlook the differences in facial activity intensity leading to overly smoothed facial movements. In this study, we propose a novel framework, CorrTalk, which effectively establishes the temporal correlation between hierarchical speech features and facial activities of different intensities across distinct regions. A novel facial activity intensity metric is defined to distinguish between strong and weak facial activity, obtained by computing the short-time Fourier transform of facial vertex displacements. Based on the variances in facial activity, we propose a dual-branch decoding framework to synchronously synthesize strong and weak facial activity, which guarantees wider intensity facial animation synthesis. Furthermore, a weighted hierarchical feature encoder is proposed to establish temporal correlation between hierarchical speech features and facial activity at different intensities, which ensures lip-sync and plausible facial expressions. Extensive qualitatively and quantitatively experiments as well as a user study indicate that our CorrTalk outperforms existing state-of-the-art methods. The source code and supplementary video are publicly available at: https://zjchu.github.io/projects/CorrTalk/

View on arXiv PDF

Similar