SDCVASIVJul 2, 2020

Spot the conversation: speaker diarisation in the wild

arXiv:2007.01216v3192 citations
AI Analysis

This addresses the problem of labeling speaker turns in real-world videos for researchers, though it is incremental as it builds on existing diarisation techniques.

The paper tackles speaker diarisation in unconstrained videos by proposing an automatic audio-visual method and a semi-automatic dataset creation pipeline, resulting in the VoxConverse dataset with overlapping speech and diverse conditions.

The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes