SDCLASJun 25, 2023

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

arXiv:2306.14145v111 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating high-quality, accent-free speech in different languages for TTS applications, representing an incremental improvement in a domain-specific area.

The paper tackles the problem of cross-lingual text-to-speech (CTTS) by proposing a dual speaker embedding framework (DSE-TTS) to improve speaker similarity and nativeness, achieving significant performance gains over the state-of-the-art SANE-TTS.

Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i.e. speaker similarity) and eliminate the accents from their first language(i.e. nativeness). In this paper, we demonstrated that vector-quantized(VQ) acoustic feature contains less speaker information than mel-spectrogram. Based on this finding, we propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style. Here, one embedding is fed to the acoustic model to learn the linguistic speaking style, while the other one is integrated into the vocoder to mimic the target speaker's timbre. Experiments show that by combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis, especially in terms of nativeness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes