Neural Machine Translation for Multilingual Grapheme-to-Phoneme Conversion
This addresses the challenge of generating pronunciations for out-of-vocabulary words in ASR systems, particularly benefiting low-resource languages and code-switching scenarios, though it is incremental as it builds on existing neural methods.
The paper tackles the problem of grapheme-to-phoneme conversion for multilingual applications by developing a single neural model that shares encoder and decoder across languages, resulting in a 7.2% average improvement in phoneme error rate for low-resource languages without degrading performance on high-resource ones.
Grapheme-to-phoneme (G2P) models are a key component in Automatic Speech Recognition (ASR) systems, such as the ASR system in Alexa, as they are used to generate pronunciations for out-of-vocabulary words that do not exist in the pronunciation lexicons (mappings like "e c h o" to "E k oU"). Most G2P systems are monolingual and based on traditional joint-sequence based n-gram models [1,2]. As an alternative, we present a single end-to-end trained neural G2P model that shares same encoder and decoder across multiple languages. This allows the model to utilize a combination of universal symbol inventories of Latin-like alphabets and cross-linguistically shared feature representations. Such model is especially useful in the scenarios of low resource languages and code switching/foreign words, where the pronunciations in one language need to be adapted to other locales or accents. We further experiment with word language distribution vector as an additional training target in order to improve system performance by helping the model decouple pronunciations across a variety of languages in the parameter space. We show 7.2% average improvement in phoneme error rate over low resource languages and no degradation over high resource ones compared to monolingual baselines.