IVCVASSep 24, 2022

Unsupervised active speaker detection in media content using cross-modal information

arXiv:2209.11896v13 citationsh-index: 89
Originality Incremental advance
AI Analysis

This addresses the problem of identifying active speakers in videos for media analysis, but it is incremental as it builds on existing unsupervised and cross-modal techniques.

The authors tackled active speaker detection in media content by formulating it as a cross-modal speech-face assignment task, achieving competitive performance with state-of-the-art supervised methods on three benchmark datasets.

We present a cross-modal unsupervised framework for active speaker detection in media content such as TV shows and movies. Machine learning advances have enabled impressive performance in identifying individuals from speech and facial images. We leverage speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task such that the active speaker's face and the underlying speech identify the same person (character). We express the speech segments in terms of their associated speaker identity distances, from all other speech segments, to capture a relative identity structure for the video. Then we assign an active speaker's face to each speech segment from the concurrently appearing faces such that the obtained set of active speaker faces displays a similar relative identity structure. Furthermore, we propose a simple and effective approach to address speech segments where speakers are present off-screen. We evaluate the proposed system on three benchmark datasets -- Visual Person Clustering dataset, AVA-active speaker dataset, and Columbia dataset -- consisting of videos from entertainment and broadcast media, and show competitive performance to state-of-the-art fully supervised methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes