SDCVMMASDec 11, 2024

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues

arXiv:2412.08247v23 citationsh-index: 9ICME
Originality Incremental advance
AI Analysis

This addresses the challenge of unstable speaker extraction in real-world applications with impaired visual cues, though it is incremental as it builds on existing audio-visual methods.

The paper tackles the problem of audio-visual target speaker extraction in real-time scenarios where visual cues are impaired, by introducing MoMuSE, which uses a momentum memory to track speakers, resulting in significant improvements, especially under severe visual impairments.

Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate the speech of a specific target speaker from an audio mixture using time-synchronized visual cues. In real-world scenarios, visual cues are not always available due to various impairments, which undermines the stability of AV-TSE. Despite this challenge, humans can maintain attentional momentum over time, even when the target speaker is not visible. In this paper, we introduce the Momentum Multi-modal target Speaker Extraction (MoMuSE), which retains a speaker identity momentum in memory, enabling the model to continuously track the target speaker. Designed for real-time inference, MoMuSE extracts the current speech window with guidance from both visual cues and dynamically updated speaker momentum. Experimental results demonstrate that MoMuSE exhibits significant improvement, particularly in scenarios with severe impairment of visual cues.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes