A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data
This work addresses speech quality issues in voice conversion for applications like speech synthesis, though it is incremental as it builds on existing WaveNet and PPG techniques.
The paper tackles the problem of speech quality degradation in voice conversion systems caused by vocoders by introducing a vocoder-free approach using WaveNet to map Phonetic PosteriorGrams directly to waveforms with non-parallel data, achieving significantly better speech quality than baseline methods on the CMU-ARCTIC database.
In a typical voice conversion system, vocoder is commonly used for speech-to-features analysis and features-to-speech synthesis. However, vocoder can be a source of speech quality degradation. This paper presents a vocoder-free voice conversion approach using WaveNet for non-parallel training data. Instead of dealing with the intermediate features, the proposed approach utilizes the WaveNet to map the Phonetic PosteriorGrams (PPGs) to the waveform samples directly. In this way, we avoid the estimation errors caused by vocoder and feature conversion. Additionally, as PPG is assumed to be speaker independent, the proposed method also reduces the feature mismatch problem in WaveNet vocoder based approaches. Experimental results conducted on the CMU-ARCTIC database show that the proposed approach significantly outperforms the baseline approaches in terms of speech quality.