FastVC: Fast Voice Conversion with non-parallel data
This work addresses the problem of resource-efficient voice conversion for applications requiring fast processing, though it is incremental as it builds on existing AutoEncoder methods.
The paper tackles voice conversion with non-parallel data by introducing FastVC, an end-to-end conditional AutoEncoder model that converts speech across multiple speakers and languages, achieving higher naturalness than VC Challenge 2020 baselines in cross-lingual tasks.
This paper introduces FastVC, an end-to-end model for fast Voice Conversion (VC). The proposed model can convert speech of arbitrary length from multiple source speakers to multiple target speakers. FastVC is based on a conditional AutoEncoder (AE) trained on non-parallel data and requires no annotations at all. This model's latent representation is shown to be speaker-independent and similar to phonemes, which is a desirable feature for VC systems. While the current VC systems primarily focus on achieving the highest overall speech quality, this paper tries to balance the development concerning resources needed to run the systems. Despite the simple structure of the proposed model, it outperforms the VC Challenge 2020 baselines on the cross-lingual task in terms of naturalness.