AS LGNov 6, 2023

Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction

Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Dongjune Lee, Nam Soo Kim

arXiv:2311.02898v25.918 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses text-to-speech synthesis for applications requiring efficient and controllable speech generation, though it is incremental as it builds on existing neural transducer and tokenization methods.

The authors tackled the problem of text-to-speech synthesis by proposing a neural transducer framework that uses discretized semantic tokens for alignment and a non-autoregressive generator for speech synthesis, achieving higher speech quality and speaker similarity than baselines in zero-shot adaptive TTS experiments.

We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints. The proposed model first generates aligned semantic tokens using the neural transducer, then synthesizes a speech sample from the semantic tokens using a non-autoregressive(NAR) speech generator. This decoupled framework alleviates the training complexity of TTS and allows each stage to focus on 1) linguistic and alignment modeling and 2) fine-grained acoustic modeling, respectively. Experimental results on the zero-shot adaptive TTS show that the proposed model exceeds the baselines in speech quality and speaker similarity via objective and subjective measures. We also investigate the inference speed and prosody controllability of our proposed model, showing the potential of the neural transducer for TTS frameworks.

View on arXiv PDF

Similar