ASSDOct 22, 2020

Momentum Contrast Speaker Representation Learning

arXiv:2010.11457v12 citations
Originality Incremental advance
AI Analysis

This work addresses speaker verification by enabling unsupervised learning, potentially reducing reliance on labeled data for open-set speaker recognition.

The authors tackled unsupervised speaker representation learning by proposing MoCoVox, a momentum contrast method applied to speech data, which outperformed state-of-the-art metric learning approaches by a large margin in speaker verification.

Unsupervised representation learning has shown remarkable achievement by reducing the performance gap with supervised feature learning, especially in the image domain. In this study, to extend the technique of unsupervised learning to the speech domain, we propose the Momentum Contrast for VoxCeleb (MoCoVox) as a form of learning mechanism. We pre-trained the MoCoVox on the VoxCeleb1 by implementing instance discrimination. Applying MoCoVox for speaker verification revealed that it outperforms the state-of-the-art metric learning-based approach by a large margin. We also empirically demonstrate the features of contrastive learning in the speech domain by analyzing the distribution of learned representations. Furthermore, we explored which pretext task is adequate for speaker verification. We expect that learning speaker representation without human supervision helps to address the open-set speaker recognition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes