Controllable and Interpretable Singing Voice Decomposition via Assem-VC
This addresses the problem of controllable and interpretable singing voice synthesis for music production or entertainment applications, but it appears incremental as it builds on existing voice conversion methods.
The paper tackled singing voice decomposition by encoding linguistic content, pitch, and speaker identity using Assem-VC, enabling synthesis of a target speaker's singing voice from decomposed components. The result was a perfectly synced duet between a user's voice and the converted target singer's voice.
We propose a singing decomposition system that encodes time-aligned linguistic content, pitch, and source speaker identity via Assem-VC. With decomposed speaker-independent information and the target speaker's embedding, we could synthesize the singing voice of the target speaker. In conclusion, we made a perfectly synced duet with the user's singing voice and the target singer's converted singing voice.