SDAICVMMASJun 21, 2022

Rethinking Audio-visual Synchronization for Active Speaker Detection

arXiv:2206.10421v221 citationsh-index: 77
Originality Incremental advance
AI Analysis

This work addresses a definitional inconsistency in active speaker detection for multi-talker conversation analysis, though it appears incremental by refining existing models.

The paper tackled the problem of active speaker detection by clarifying the definition to require audio-visual synchronization, and proposed a model using cross-modal contrastive learning and positional encoding that successfully detects unsynchronized speaking as not speaking.

Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking. To address this problem, we propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue. Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes