ASCLSDMLApr 3, 2018

Speech waveform synthesis from MFCC sequences with generative adversarial networks

arXiv:1804.00920v155 citations
Originality Incremental advance
AI Analysis

This addresses a challenge in speech synthesis for applications like ASR by enabling synthesis from widely used MFCC features, though it is incremental as it builds on existing techniques.

The paper tackles the problem of synthesizing speech from MFCC sequences, which are typically not used for synthesis, and achieves high-quality speech reconstruction using a method that predicts pitch and voicing, converts spectral information to filters, and adds noise with a GAN.

This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in speech applications, such as ASR, but are generally considered unusable for speech synthesis. First, we predict fundamental frequency and voicing information from MFCCs with an autoregressive recurrent neural net. Second, the spectral envelope information contained in MFCCs is converted to all-pole filters, and a pitch-synchronous excitation model matched to these filters is trained. Finally, we introduce a generative adversarial network -based noise model to add a realistic high-frequency stochastic component to the modeled excitation signal. The results show that high quality speech reconstruction can be obtained, given only MFCC information at test time.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes