ASAILGSDSPFeb 8, 2023

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

arXiv:2302.04215v151 citationsh-index: 83
Originality Incremental advance
AI Analysis

This work addresses the challenge of handling speech diversity for AI communication systems, though it is incremental as it builds on existing TTS methods with a novel architectural tweak.

The paper tackled the problem of synthesizing diverse real-world spontaneous speech by training TTS systems on YouTube and podcast data, and it introduced MQTTS, which outperformed existing systems in objective and subjective measures by using multiple discrete codes to resolve alignment mismatches.

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes