SDCVLGASIVFeb 18, 2021

AudioVisual Speech Synthesis: A brief literature review

arXiv:2103.03927v1
Originality Synthesis-oriented
AI Analysis

It provides a structured overview for researchers in speech and animation, but is incremental as it reviews existing work without new findings.

This literature review examines audiovisual speech synthesis, which generates animated talking heads from text, by decomposing it into text-to-speech synthesis and voice-driven animation components, summarizing key methods and their trade-offs.

This brief literature review studies the problem of audiovisual speech synthesis, which is the problem of generating an animated talking head given a text as input. Due to the high complexity of this problem, we approach it as the composition of two problems. Specifically, that of Text-to-Speech (TTS) synthesis as well as the voice-driven talking head animation. For TTS, we present models that are used to map text to intermediate acoustic representations, e.g. mel-spectrograms, as well as models that generate voice signals conditioned on these intermediate representations, i.e vocoders. For the talking-head animation problem, we categorize approaches based on whether they produce human faces or anthropomorphic figures. An attempt is also made to discuss the importance of the choice of facial models in the second case. Throughout the review, we briefly describe the most important work in audiovisual speech synthesis, trying to highlight the advantages and disadvantages of the various approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes