AS LGAug 30, 2024

Text-to-Speech for Unseen Speakers via Low-Complexity Discrete Unit-Based Frame Selection

Ismail Rasim Ulgen, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman

arXiv:2408.17432v32.31 citationsh-index: 25

Originality Incremental advance

AI Analysis

This addresses the problem of high model complexity limiting reproducibility and accessibility in TTS for researchers and practitioners in resource-limited settings, offering an incremental improvement through a simpler alternative.

The paper tackles the challenge of synthesizing voices for unseen speakers in multi-speaker text-to-speech by proposing SelectTTS, a low-complexity method that selects frames from target speaker speech and uses SSL features, achieving performance comparable to state-of-the-art systems with over 8x fewer parameters and 270x less training data.

Synthesizing the voices of unseen speakers remains a persisting challenge in multi-speaker text-to-speech (TTS). Existing methods model speaker characteristics through speaker conditioning during training, leading to increased model complexity and limiting reproducibility and accessibility. A low-complexity alternative would broaden the reach of speech synthesis research, particularly in settings with limited computational and data resources. To this end, we propose SelectTTS, a simple and effective alternative. SelectTTS selects appropriate frames from the target speaker and decodes them using frame-level self-supervised learning (SSL) features. We demonstrate that this approach can effectively capture speaker characteristics for unseen speakers and achieves performance comparable to state-of-the-art multi-speaker TTS frameworks on both objective and subjective metrics. By directly selecting frames from the target speaker's speech, SelectTTS enables generalization to unseen speakers with significantly lower model complexity. Experimental results show that the proposed approach achieves performance comparable to state-of-the-art systems such as XTTS-v2 and VALL-E, while requiring over 8x fewer parameters and 270x less training data. Moreover, it demonstrates that frame selection with SSL features offers an efficient path to low-complexity, high-quality multi-speaker TTS.

View on arXiv PDF

Similar