SDLGASMar 21, 2022

AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling

arXiv:2203.11049v24 citationsh-index: 18
AI Analysis

This addresses the issue for TTS researchers and practitioners by simplifying the training process, though it is incremental as it builds on existing parallel TTS methods.

The paper tackled the problem of parallel text-to-speech synthesis requiring external alignment models by proposing a differentiable duration method for learning monotonic alignments, resulting in a model that achieves competitive results with a simpler training pipeline.

Parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. However, they typically require external alignment models, which are not necessarily optimized for the decoder as they are not jointly trained. In this paper, we propose a differentiable duration method for learning monotonic alignments between input and output sequences. Our method is based on a soft-duration mechanism that optimizes a stochastic process in expectation. Using this differentiable duration method, we introduce AutoTTS, a direct text-to-waveform speech synthesis model. AutoTTS enables high-fidelity speech synthesis through a combination of adversarial training and matching the total ground-truth duration. Experimental results show that our model obtains competitive results while enjoying a much simpler training pipeline. Audio samples are available online.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes