CVSDASOct 14, 2022

Intel Labs at Ego4D Challenge 2022: A Better Baseline for Audio-Visual Diarization

arXiv:2210.07764v317 citationsh-index: 11
Originality Synthesis-oriented
AI Analysis

This work addresses audio-visual diarization for egocentric video analysis, but it is incremental as it builds on existing baselines with technical improvements.

The paper tackled the Audio-Visual Diarization task in the Ego4D Challenge 2022 by improving detection of the camera wearer's voice activity and using an off-the-shelf model to reduce false positives, resulting in a 65.9% DER on the test set and achieving first place.

This report describes our approach for the Audio-Visual Diarization (AVD) task of the Ego4D Challenge 2022. Specifically, we present multiple technical improvements over the official baselines. First, we improve the detection performance of the camera wearer's voice activity by modifying the training scheme of its model. Second, we discover that an off-the-shelf voice activity detection model can effectively remove false positives when it is applied solely to the camera wearer's voice activities. Lastly, we show that better active speaker detection leads to a better AVD outcome. Our final method obtains 65.9% DER on the test set of Ego4D, which significantly outperforms all the baselines. Our submission achieved 1st place in the Ego4D Challenge 2022.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes