SDCVLGASMar 8, 2022

VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

arXiv:2203.04099v225 citationsh-index: 22
AI Analysis

This addresses voice separation for applications requiring real-time processing, but it is incremental as it builds on existing audio-visual methods.

The paper tackles audio-visual voice separation for speech and singing voice, achieving state-of-the-art results with low latency by using a two-stage network combining graph convolutional networks and transformers.

This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes