ASAISDMay 27, 2025

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

arXiv:2505.20635v12 citationsh-index: 10INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the problem of robust speaker extraction in real-world scenarios with multiple faces for applications in audio processing, though it is incremental as it builds on existing models.

The paper tackles audio-visual speaker extraction in multi-person environments by introducing a plug-and-play attention module to process co-occurring faces, resulting in consistent performance improvements over baselines across diverse datasets like VoxCeleb2 and MISP.

Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes