AS AI CLJun 22, 2024

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers

Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Guanrou Yang, Xie Chen

arXiv:2406.15752v15.97 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses efficiency and quality issues in zero-shot TTS synthesis, offering a more practical solution for applications requiring real-time or resource-constrained speech generation.

The paper tackled the problem of slow inference speed and instability in zero-shot text-to-speech synthesis by proposing TacoLM, which achieved a 90% reduction in parameters and a 5.2 times speed-up compared to VALL-E, while improving word error rate, speaker similarity, and mean opinion score on the Librispeech corpus.

Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit alignment between text and audio. In this work, to handle these challenges, we introduce a new variant of neural codec LM, namely TacoLM. Specifically, TacoLM introduces a gated attention mechanism to improve the training and inference efficiency and reduce the model size. Meanwhile, an additional gated cross-attention layer is included for each decoder layer, which improves the efficiency and content accuracy of the synthesized speech. In the evaluation of the Librispeech corpus, the proposed TacoLM achieves a better word error rate, speaker similarity, and mean opinion score, with 90% fewer parameters and 5.2 times speed up, compared with VALL-E. Demo and code is available at https://ereboas.github.io/TacoLM/.

View on arXiv PDF Code

Similar