CVLGNISISep 29, 2022

Facial Landmark Predictions with Applications to Metaverse

arXiv:2209.14698v1h-index: 11
Originality Synthesis-oriented
AI Analysis

This work addresses the specific problem of enhancing realism in metaverse applications through automated lip-syncing, representing an incremental improvement by adapting existing methods to a new domain.

This research tackled the problem of making metaverse characters more realistic by generating lip animations from speech, achieving this by extending Tacotron 2 to predict lip landmark displacements in one pass, with training converging in 7 hours using less than 5 minutes of video data.

This research aims to make metaverse characters more realistic by adding lip animations learnt from videos in the wild. To achieve this, our approach is to extend Tacotron 2 text-to-speech synthesizer to generate lip movements together with mel spectrogram in one pass. The encoder and gate layer weights are pre-trained on LJ Speech 1.1 data set while the decoder is retrained on 93 clips of TED talk videos extracted from LRS 3 data set. Our novel decoder predicts displacement in 20 lip landmark positions across time, using labels automatically extracted by OpenFace 2.0 landmark predictor. Training converged in 7 hours using less than 5 minutes of video. We conducted ablation study for Pre/Post-Net and pre-trained encoder weights to demonstrate the effectiveness of transfer learning between audio and visual speech data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes