AS CL LGJul 31, 2023

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura-Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski, Roberto Barra-Chicote, Daniel Korzekwa

arXiv:2307.16679v13.34 citationsh-index: 21

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of enhancing speech synthesis quality for text-to-speech systems, representing an incremental improvement by evaluating existing methods on specific tasks.

The paper tackled the problem of improving prosody and acoustic modeling in text-to-speech by comparing traditional L1/L2 losses with normalizing flows and diffusion models, finding that flow-based models achieved the best performance for spectrogram prediction and both flow and diffusion models significantly improved prosody prediction over L2-based models.

Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models.

View on arXiv PDF

Similar