An Investigation of the Relation Between Grapheme Embeddings and Pronunciation for Tacotron-based Systems
This work addresses the problem of understanding learned representations in end-to-end TTS models for researchers and practitioners, though it is incremental as it builds on existing Tacotron frameworks.
The study investigated the relationship between grapheme embeddings and pronunciation in Tacotron-based text-to-speech systems trained on French graphemes, finding that these embeddings capture phoneme information without explicit training, enabling applications like grapheme-to-phoneme conversion and pronunciation control.
End-to-end models, particularly Tacotron-based ones, are currently a popular solution for text-to-speech synthesis. They allow the production of high-quality synthesized speech with little to no text preprocessing. Indeed, they can be trained using either graphemes or phonemes as input directly. However, in the case of grapheme inputs, little is known concerning the relation between the underlying representations learned by the model and word pronunciations. This work investigates this relation in the case of a Tacotron model trained on French graphemes. Our analysis shows that grapheme embeddings are related to phoneme information despite no such information being present during training. Thanks to this property, we show that grapheme embeddings learned by Tacotron models can be useful for tasks such as grapheme-to-phoneme conversion and control of the pronunciation in synthetic speech.