SDAICVGRASMay 2, 2022

A Novel Speech-Driven Lip-Sync Model with CNN and LSTM

arXiv:2205.00916v14 citationsh-index: 44
Originality Synthesis-oriented
AI Analysis

This addresses the problem of creating realistic virtual characters for applications like animation or gaming, but it is incremental as it builds on existing deep learning methods.

The paper tackled generating synchronized lip movements for virtual characters from speech, using a CNN-LSTM model on a new Mandarin dataset, achieving smooth and natural results in evaluations.

Generating synchronized and natural lip movement with speech is one of the most important tasks in creating realistic virtual characters. In this paper, we present a combined deep neural network of one-dimensional convolutions and LSTM to generate vertex displacement of a 3D template face model from variable-length speech input. The motion of the lower part of the face, which is represented by the vertex movement of 3D lip shapes, is consistent with the input speech. In order to enhance the robustness of the network to different sound signals, we adapt a trained speech recognition model to extract speech feature, and a velocity loss term is adopted to reduce the jitter of generated facial animation. We recorded a series of videos of a Chinese adult speaking Mandarin and created a new speech-animation dataset to compensate the lack of such public data. Qualitative and quantitative evaluations indicate that our model is able to generate smooth and natural lip movements synchronized with speech.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes