AS CL LG SDJun 11, 2020

FastPitch: Parallel Text-to-speech with Pitch Prediction

arXiv:2006.06873v2411 citations

Originality Incremental advance

AI Analysis

This work addresses the need for faster and more expressive speech synthesis for applications like virtual assistants and audiobooks, representing an incremental improvement over existing models like FastSpeech.

The paper tackles the problem of generating expressive and high-quality speech in text-to-speech systems by introducing FastPitch, a fully-parallel model that predicts pitch contours, resulting in over 900x real-time factor for mel-spectrogram synthesis and achieving quality comparable to state-of-the-art methods.

We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. Uniformly increasing or decreasing pitch with FastPitch generates speech that resembles the voluntary modulation of voice. Conditioning on frequency contours improves the overall quality of synthesized speech, making it comparable to state-of-the-art. It does not introduce an overhead, and FastPitch retains the favorable, fully-parallel Transformer architecture, with over 900x real-time factor for mel-spectrogram synthesis of a typical utterance.

View on arXiv PDF

Similar