Deep Denoising Auto-encoder for Statistical Speech Synthesis
This work addresses speech synthesis quality for applications like text-to-speech systems, but it appears incremental as it builds on existing auto-encoder methods.
The paper tackled the problem of extracting better acoustic features for speech synthesis by proposing a deep denoising auto-encoder technique, which increased the quality of synthetic speech in analysis-by-synthesis and text-to-speech experiments.
This paper proposes a deep denoising auto-encoder technique to extract better acoustic features for speech synthesis. The technique allows us to automatically extract low-dimensional features from high dimensional spectral features in a non-linear, data-driven, unsupervised way. We compared the new stochastic feature extractor with conventional mel-cepstral analysis in analysis-by-synthesis and text-to-speech experiments. Our results confirm that the proposed method increases the quality of synthetic speech in both experiments.