ASCLSDMLMar 29, 2019

Training a Neural Speech Waveform Model using Spectral Losses of Short-Time Fourier Transform and Continuous Wavelet Transform

arXiv:1903.12392v21 citations
AI Analysis

This work addresses speech synthesis quality for audio processing applications, but it is incremental as it builds on existing STFT-based loss frameworks.

The authors tackled the problem of training neural speech waveform models by proposing spectral loss functions based on short-time Fourier transform (STFT) and continuous wavelet transform (CWT), which provide complementary information due to CWT's different time-frequency resolutions. Experimental results showed that the CWT-based loss can train a high-quality model comparable to STFT-based loss.

Recently, we proposed short-time Fourier transform (STFT)-based loss functions for training a neural speech waveform model. In this paper, we generalize the above framework and propose a training scheme for such models based on spectral amplitude and phase losses obtained by either STFT or continuous wavelet transform (CWT), or both of them. Since CWT is capable of having time and frequency resolutions different from those of STFT and is cable of considering those closer to human auditory scales, the proposed loss functions could provide complementary information on speech signals. Experimental results showed that it is possible to train a high-quality model by using the proposed CWT spectral loss and is as good as one using STFT-based loss.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes