CLSDASMay 12, 2025

On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud

arXiv:2505.07202v1Has Code
Originality Incremental advance
AI Analysis

This provides practical guidelines for conversational TTS developers, favoring utterance-level training for better efficiency and quality, though it is incremental as it compares existing training techniques.

This paper tackled the problem of improving conversational text-to-speech (TTS) systems by comparing context-based utterance-level training versus full conversation training, finding that utterance-level training achieved higher MOS scores (4.3/5.0 vs. 3.7/5.0) and reduced training time by 37%.

Modern TTS systems designed for conversations achieve high-quality utterances but often remain inaccessible publicly. Are existing open-source architectures inadequate, or are current training techniques insufficient? This paper investigates prominent models and their underlying behaviors regarding conversational context. Using 20 GPU-hours on an NVIDIA H100, we empirically examine two approaches: context-based utterance-level training versus full conversation training. Results demonstrate that context-based utterance training achieves superior MOS scores (4.3/5.0 vs 3.7/5.0) and reduces training time by 37%, while full conversation approaches suffer from speaker similarity hallucination issues. These findings provide practical guidelines for conversational TTS development, favoring utterance-level training with contextual conditioning for both resource efficiency and output quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes