MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues
This addresses the challenge of unstable speaker extraction in real-world applications with impaired visual cues, though it is incremental as it builds on existing audio-visual methods.
The paper tackles the problem of audio-visual target speaker extraction in real-time scenarios where visual cues are impaired, by introducing MoMuSE, which uses a momentum memory to track speakers, resulting in significant improvements, especially under severe visual impairments.
Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate the speech of a specific target speaker from an audio mixture using time-synchronized visual cues. In real-world scenarios, visual cues are not always available due to various impairments, which undermines the stability of AV-TSE. Despite this challenge, humans can maintain attentional momentum over time, even when the target speaker is not visible. In this paper, we introduce the Momentum Multi-modal target Speaker Extraction (MoMuSE), which retains a speaker identity momentum in memory, enabling the model to continuously track the target speaker. Designed for real-time inference, MoMuSE extracts the current speech window with guidance from both visual cues and dynamically updated speaker momentum. Experimental results demonstrate that MoMuSE exhibits significant improvement, particularly in scenarios with severe impairment of visual cues.