AINov 12, 2024

Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models

Dongrui Han, Mingyu Cui, Jiawen Kang, Xixin Wu, Xunying Liu, Helen Meng

arXiv:2411.07563v13 citationsh-index: 15ISCSLP

Originality Incremental advance

AI Analysis

This work addresses disambiguation challenges in text-to-speech systems for improved speech synthesis, representing an incremental advancement by applying existing LLM capabilities to a specific domain.

The paper tackled the problem of grapheme-to-phoneme conversion ambiguities in text-to-speech systems by proposing a contextual approach using large language models with in-context knowledge retrieval, resulting in a 2.0% absolute reduction in phoneme error rate and up to 3.5% absolute improvement with GPT-4.

Grapheme-to-phoneme (G2P) conversion is a crucial step in Text-to-Speech (TTS) systems, responsible for mapping grapheme to corresponding phonetic representations. However, it faces ambiguities problems where the same grapheme can represent multiple phonemes depending on contexts, posing a challenge for G2P conversion. Inspired by the remarkable success of Large Language Models (LLMs) in handling context-aware scenarios, contextual G2P conversion systems with LLMs' in-context knowledge retrieval (ICKR) capabilities are proposed to promote disambiguation capability. The efficacy of incorporating ICKR into G2P conversion systems is demonstrated thoroughly on the Librig2p dataset. In particular, the best contextual G2P conversion system using ICKR outperforms the baseline with weighted average phoneme error rate (PER) reductions of 2.0% absolute (28.9% relative). Using GPT-4 in the ICKR system can increase of 3.5% absolute (3.8% relative) on the Librig2p dataset.

View on arXiv PDF

Similar