SD CL ASJun 29, 2025

You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties

Paige Tuttösí, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier, Angelica Lim

arXiv:2506.23367v24.0h-index: 2913th edition of the Speech Synthesis Workshop

Originality Incremental advance

AI Analysis

This addresses the challenge of making speech technology more accessible and effective for L2 learners, though it is incremental as it builds on existing TTS methods with a specific linguistic adaptation.

The paper tackles the problem of improving speech intelligibility for second language (L2) speakers by developing a TTS system that adjusts vowel durations based on tense-lax distinctions, resulting in at least 9.15% fewer transcription errors for French-L1 English-L2 listeners.

We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a "clarity mode" for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.

View on arXiv PDF

Similar