AS CL SDApr 9, 2023

An investigation of phrase break prediction in an End-to-End TTS system

arXiv:2304.04157v33.32 citationsh-index: 6Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of enhancing speech naturalness and comprehension for users of TTS systems, but it is incremental as it applies existing models to a known bottleneck in TTS.

This work tackled the problem of improving listener comprehension in End-to-End Text-to-Speech (TTS) systems by using external phrase break prediction models, and the result showed a clear listener preference for synthesized text with predicted phrase breaks over text without them.

Purpose: This work explores the use of external phrase break prediction models to enhance listener comprehension in End-to-End Text-to-Speech (TTS) systems. Methods: The effectiveness of these models is evaluated based on listener preferences in subjective tests. Two approaches are explored: (1) a bidirectional LSTM model with task-specific embeddings trained from scratch, and (2) a pre-trained BERT model fine-tuned on phrase break prediction. Both models are trained on a multi-speaker English corpus to predict phrase break locations in text. The End-to-End TTS system used comprises a Tacotron2 model with Dynamic Convolutional Attention for mel spectrogram prediction and a WaveRNN vocoder for waveform generation. Results: The listening tests show a clear preference for text synthesized with predicted phrase breaks over text synthesized without them. Conclusion: These results confirm the value of incorporating external phrasing models within End-to-End TTS to enhance listener comprehension.

View on arXiv PDF Code

Similar