ASSDOct 23, 2020

The IDLAB VoxCeleb Speaker Recognition Challenge 2020 System Description

arXiv:2010.12468v156 citations
Originality Incremental advance
AI Analysis

This work addresses speaker verification for audio processing applications, representing an incremental improvement with system fusion and enhancements.

The authors tackled speaker verification in the VoxCeleb Speaker Recognition Challenge 2020 by developing supervised systems using ECAPA-TDNN and Resnet34 with large margin fine-tuning and quality-aware score calibration, achieving first place in supervised tracks, and an unsupervised system via contrastive learning and pseudo-labeling that won the unsupervised track and approached supervised performance.

In this technical report we describe the IDLAB top-scoring submissions for the VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20) in the supervised and unsupervised speaker verification tracks. For the supervised verification tracks we trained 6 state-of-the-art ECAPA-TDNN systems and 4 Resnet34 based systems with architectural variations. On all models we apply a large margin fine-tuning strategy, which enables the training procedure to use higher margin penalties by using longer training utterances. In addition, we use quality-aware score calibration which introduces quality metrics in the calibration system to generate more consistent scores across varying levels of utterance conditions. A fusion of all systems with both enhancements applied led to the first place on the open and closed supervised verification tracks. The unsupervised system is trained through contrastive learning. Subsequent pseudo-label generation by iterative clustering of the training embeddings allows the use of supervised techniques. This procedure led to the winning submission on the unsupervised track, and its performance is closing in on supervised training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes