Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning
This addresses voice conversion for applications requiring speaker identity modification without parallel data, representing a novel method for a known bottleneck.
The paper tackles voice conversion with non-parallel data by proposing the Stepback network, which enhances disentanglement and content preservation, resulting in significantly improved performance and reduced training costs.
Voice conversion (VC) modifies voice characteristics while preserving linguistic content. This paper presents the Stepback network, a novel model for converting speaker identity using non-parallel data. Unlike traditional VC methods that rely on parallel data, our approach leverages deep learning techniques to enhance disentanglement completion and linguistic content preservation. The Stepback network incorporates a dual flow of different domain data inputs and uses constraints with self-destructive amendments to optimize the content encoder. Extensive experiments show that our model significantly improves VC performance, reducing training costs while achieving high-quality voice conversion. The Stepback network's design offers a promising solution for advanced voice conversion tasks.