CLSDASAug 16, 2023

Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction

arXiv:2308.08442v11 citationsh-index: 44
Originality Incremental advance
AI Analysis

This work addresses a usability issue in real-world text-to-speech applications by enhancing sentence-level G2P, but it is incremental as it builds on existing ByT5 models.

The paper tackles the exposure bias problem in sentence-level grapheme-to-phoneme transduction using ByT5, proposing a loss-based sampling method that improves performance, though specific numerical gains are not detailed.

Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free byte-level model based on T5 referred to as ByT5, recently gave promising results on word-level G2P conversion by representing each input character with its corresponding UTF-8 encoding. Although it is generally understood that sentence-level or paragraph-level G2P can improve usability in real-world applications as it is better suited to perform on heteronyms and linking sounds between words, we find that using ByT5 for these scenarios is nontrivial. Since ByT5 operates on the character level, it requires longer decoding steps, which deteriorates the performance due to the exposure bias commonly observed in auto-regressive generation models. This paper shows that the performance of sentence-level and paragraph-level G2P can be improved by mitigating such exposure bias using our proposed loss-based sampling method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes