OLaPh: Optimal Language Phonemizer
This addresses phonemization accuracy for text-to-speech systems, particularly for out-of-vocabulary words, though it appears incremental as it builds on existing methods.
The paper tackles the problem of phonemization (converting text to phonemes) for text-to-speech, particularly for challenging cases like names and loanwords, by introducing OLaPh, a framework combining lexica, NLP techniques, and probabilistic scoring, which shows improved accuracy in German and English evaluations, with further gains from an LLM trained on its data.
Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.