CL ASJan 30, 2021

Triple M: A Practical Text-to-speech Synthesis System With Multi-guidance Attention And Multi-band Multi-time LPCNet

Shilun Lin, Fenglong Xie, Li Meng, Xinhui Li, Li Lu

arXiv:2102.00247v40.71 citations

Originality Incremental advance

AI Analysis

This work addresses efficiency and naturalness in TTS for online services, but appears incremental as it builds on existing attention and vocoder frameworks.

The authors tackled the problem of robust and efficient text-to-speech synthesis for large-scale online applications by proposing Triple M, which uses multi-guidance attention to reduce word error rate by 26.8% and a multi-band multi-time vocoder to speed up LPCNet by 2.75x.

In this work, a robust and efficient text-to-speech (TTS) synthesis system named Triple M is proposed for large-scale online application. The key components of Triple M are: 1) A sequence-to-sequence model adopts a novel multi-guidance attention to transfer complementary advantages from guiding attention mechanisms to the basic attention mechanism without in-domain performance loss and online service modification. Compared with single attention mechanism, multi-guidance attention not only brings better naturalness to long sentence synthesis, but also reduces the word error rate by 26.8%. 2) A new efficient multi-band multi-time vocoder framework, which reduces the computational complexity from 2.8 to 1.0 GFLOP and speeds up LPCNet by 2.75x on a single CPU.

View on arXiv PDF

Similar