CLSDASJan 19, 2024

Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech

arXiv:2401.10465v14 citationsICASSP
Originality Incremental advance
AI Analysis

This addresses the high cost and suboptimal phoneme representation issues in lexicon-dependent G2P systems for text-to-speech applications, though it appears incremental as it builds on existing self-supervised learning techniques.

The paper tackled the problem of grapheme-to-phoneme conversion in text-to-speech systems by eliminating the need for hand-crafted lexicons, using self-supervised learning to obtain data-driven phoneme representations, and achieved performance as good or marginally better than lexicon-based methods in terms of Mean Opinion Score.

Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes