SDASMar 26, 2021

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

arXiv:2103.14574v783 citations
Originality Highly original
AI Analysis

This work addresses text-to-speech synthesis for applications needing efficient and controllable speech generation, though it is incremental as it builds on prior non-autoregressive models.

The paper tackles the problem of generating speech from text without requiring supervised duration signals, by introducing Parallel Tacotron 2, a non-autoregressive neural TTS model with a differentiable duration model; it shows improved subjective naturalness over baselines in multi-speaker evaluations.

This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping, this model can learn token-frame alignments as well as token durations automatically. Experimental results show that Parallel Tacotron 2 outperforms baselines in subjective naturalness in several diverse multi speaker evaluations. Its duration control capability is also demonstrated.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes