SDLGASJul 27, 2022

SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

arXiv:2207.13703v123 citationsh-index: 31
Originality Incremental advance
AI Analysis

This addresses pronunciation accuracy issues in text-to-speech systems, particularly for homographs, though it appears incremental as it builds on existing G2P and speech recognition techniques.

The paper tackles the problem of pronunciation disambiguation in speech synthesis by proposing SoundChoice, a novel grapheme-to-phoneme (G2P) architecture that processes entire sentences instead of individual words, achieving a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using LibriSpeech and Wikipedia data.

End-to-end speech synthesis models directly convert the input characters into an audio representation (e.g., spectrograms). Despite their impressive performance, such models have difficulty disambiguating the pronunciations of identically spelled words. To mitigate this issue, a separate Grapheme-to-Phoneme (G2P) model can be employed to convert the characters into phonemes before synthesizing the audio. This paper proposes SoundChoice, a novel G2P architecture that processes entire sentences rather than operating at the word level. The proposed architecture takes advantage of a weighted homograph loss (that improves disambiguation), exploits curriculum learning (that gradually switches from word-level to sentence-level G2P), and integrates word embeddings from BERT (for further performance improvement). Moreover, the model inherits the best practices in speech recognition, including multi-task learning with Connectionist Temporal Classification (CTC) and beam search with an embedded language model. As a result, SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia. Index Terms grapheme-to-phoneme, speech synthesis, text-tospeech, phonetics, pronunciation, disambiguation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes