AS AI SDMay 27, 2025

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Zexu Pan, Shengkui Zhao, Tingting Wang, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma

arXiv:2505.20635v13.32 citationsh-index: 10INTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses the problem of robust speaker extraction in real-world scenarios with multiple faces for applications in audio processing, though it is incremental as it builds on existing models.

The paper tackles audio-visual speaker extraction in multi-person environments by introducing a plug-and-play attention module to process co-occurring faces, resulting in consistent performance improvements over baselines across diverse datasets like VoxCeleb2 and MISP.

Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.

View on arXiv PDF

Similar