CVSDASSep 21, 2023

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

arXiv:2309.12306v116 citationsh-index: 36
Originality Incremental advance
AI Analysis

This work addresses the problem of accurately detecting active speakers in videos for applications like video analysis and human-computer interaction, representing an incremental improvement over existing methods.

The paper tackles Active Speaker Detection by proposing TalkNCE, a talk-aware contrastive loss that improves model performance by learning effective representations from speech and facial movements, achieving state-of-the-art results on AVA-ActiveSpeaker and ASW datasets.

The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking. This encourages the model to learn effective representations through the natural correspondence of speech and facial movements. Our loss can be jointly optimized with the existing objectives for training ASD models without the need for additional supervision or training data. The experiments demonstrate that our loss can be easily integrated into the existing ASD frameworks, improving their performance. Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes