HCDec 30, 2016

Synthesis of Tongue Motion and Acoustics from Text using a Multimodal Articulatory Database

Ingmar Steiner, Sébastien Le Maguer, Alexander Hewer

arXiv:1612.09352v45 citations

Originality Incremental advance

AI Analysis

This work enables adding an articulatory modality to conventional text-to-speech applications without extra data, benefiting speech synthesis and related fields.

The researchers tackled the problem of generating synchronized tongue motion and audio from text by adapting a 3D tongue model to an articulatory dataset and training a statistical parametric speech synthesis system, achieving a global mean Euclidean distance of less than 2.8 mm in predicted articulatory movements.

We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a 3D model of the tongue surface to an articulatory dataset and training a statistical parametric speech synthesis system directly on the tongue model parameters. We evaluate the model at every step by comparing the spatial coordinates of predicted articulatory movements against the reference data. The results indicate a global mean Euclidean distance of less than 2.8 mm, and our approach can be adapted to add an articulatory modality to conventional TTS applications without the need for extra data.

View on arXiv PDF

Similar