RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

Jinhyeok Yang, Hyeongju Kim, Yechan Yu, Joon Byun, Frederik Bous, Juheon Lee

arXiv:2605.2208358.8

AI Analysis

For TTS researchers and practitioners, it offers a simple, plug-in training strategy to reduce skip/repeat errors without external aligners or preference data.

RobustSpeechFlow improves alignment robustness in flow-matching TTS by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, reducing word error rate from 1.44 to 1.38 on Seed-TTS-eval and character error rate from 0.48% to 0.35% (English) and 0.81% to 0.57% (Korean) on ZERO500.

While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: https://robustspeechflow.github.io/

View on arXiv PDF

Similar