A cappella: Audio-visual Singing Voice Separation
This work addresses the task of singing voice separation for applications in music video processing, presenting a novel multimodal approach and dataset, but it is incremental as it builds on existing audio-visual methods.
The paper tackled the problem of isolating a target singing voice in music videos by proposing an audio-visual convolutional network based on graphs, which achieved state-of-the-art results in singing voice separation, particularly outperforming baselines in challenging setups involving overlapping voices and lower volume levels.
The task of isolating a target singing voice in music videos has useful applications. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We also propose an audio-visual convolutional network based on graphs which achieves state-of-the-art singing voice separation results on our dataset and compare it against its audio-only counterpart, U-Net, and a state-of-the-art audio-visual speech separation model. We evaluate the models in the following challenging setups: i) presence of overlapping voices in the audio mixtures, ii) the target voice set to lower volume levels in the mix, and iii) combination of i) and ii). The third one being the most challenging evaluation setup. We demonstrate that our model outperforms the baseline models in the singing voice separation task in the most challenging evaluation setup. The code, the pre-trained models, and the dataset are publicly available at https://ipcv.github.io/Acappella/able at https://ipcv.github.io/Acappella/