CLSDASMay 12, 2020

DiscreTalk: Text-to-Speech as a Machine Translation Problem

arXiv:2005.05525v136 citations
Originality Incremental advance
AI Analysis

This approach addresses over-smoothing and hyperparameter issues in TTS for speech synthesis applications, though it appears incremental as it builds on existing NMT and VQ-VAE techniques.

The paper tackled the problem of text-to-speech synthesis by framing it as a machine translation task, using a VQ-VAE and Transformer-NMT model to map text to discrete speech symbols, and it outperformed a conventional Transformer-TTS model in naturalness on the JSUT corpus, achieving performance comparable to VQ-VAE reconstruction.

This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT). The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model. The VQ-VAE model learns a mapping function from a speech waveform into a sequence of discrete symbols, and then the Transformer-NMT model is trained to estimate this discrete symbol sequence from a given input text. Since the VQ-VAE model can learn such a mapping in a fully-data-driven manner, we do not need to consider hyperparameters of the feature extraction required in the conventional E2E-TTS models. Thanks to the use of discrete symbols, we can use various techniques developed in NMT and automatic speech recognition (ASR) such as beam search, subword units, and fusions with a language model. Furthermore, we can avoid an over smoothing problem of predicted features, which is one of the common issues in TTS. The experimental evaluation with the JSUT corpus shows that the proposed method outperforms the conventional Transformer-TTS model with a non-autoregressive neural vocoder in naturalness, achieving the performance comparable to the reconstruction of the VQ-VAE model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes