SDLGASMLApr 9, 2019

CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

arXiv:1904.04631v1289 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving voice conversion quality for applications like speech synthesis, though it is incremental over prior CycleGAN-based methods.

The paper tackled the problem of reducing the gap between real and converted speech in non-parallel voice conversion by proposing CycleGAN-VC2, which incorporates three improved techniques, resulting in better naturalness and similarity for all speaker pairs compared to CycleGAN-VC.

Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Recently, CycleGAN-VC has provided a breakthrough and performed comparably to a parallel VC method without relying on any extra data, modules, or time alignment procedures. However, there is still a large gap between the real target and converted speech, and bridging this gap remains a challenge. To reduce this gap, we propose CycleGAN-VC2, which is an improved version of CycleGAN-VC incorporating three new techniques: an improved objective (two-step adversarial losses), improved generator (2-1-2D CNN), and improved discriminator (PatchGAN). We evaluated our method on a non-parallel VC task and analyzed the effect of each technique in detail. An objective evaluation showed that these techniques help bring the converted feature sequence closer to the target in terms of both global and local structures, which we assess by using Mel-cepstral distortion and modulation spectra distance, respectively. A subjective evaluation showed that CycleGAN-VC2 outperforms CycleGAN-VC in terms of naturalness and similarity for every speaker pair, including intra-gender and inter-gender pairs.

Code Implementations6 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes