ASCLSep 1, 2025

MixedG2P-T5: G2P-free Speech Synthesis for Mixed-script texts using Speech Self-Supervised Learning and Language Model

arXiv:2509.01391v1h-index: 30APSIPA
Originality Incremental advance
AI Analysis

This addresses the cost and scalability issues in speech synthesis for mixed-script texts, though it is incremental as it builds on existing T5 and SSL methods.

The study tackled the problem of grapheme-to-phoneme conversion in speech synthesis by developing a model that generates discrete tokens directly from speech using self-supervised learning, eliminating manual transcription and matching the performance of conventional systems.

This study presents a novel approach to voice synthesis that can substitute the traditional grapheme-to-phoneme (G2P) conversion by using a deep learning-based model that generates discrete tokens directly from speech. Utilizing a pre-trained voice SSL model, we train a T5 encoder to produce pseudo-language labels from mixed-script texts (e.g., containing Kanji and Kana). This method eliminates the need for manual phonetic transcription, reducing costs and enhancing scalability, especially for large non-transcribed audio datasets. Our model matches the performance of conventional G2P-based text-to-speech systems and is capable of synthesizing speech that retains natural linguistic and paralinguistic features, such as accents and intonations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes