ASSDOct 22, 2020

Unsupervised Representation Learning for Speaker Recognition via Contrastive Equilibrium Learning

arXiv:2010.11433v124 citations
Originality Incremental advance
AI Analysis

This addresses speaker verification for audio applications, offering an incremental improvement over existing unsupervised methods.

The paper tackles the problem of unsupervised speaker recognition by proposing Contrastive Equilibrium Learning (CEL), which uses uniformity and contrastive similarity losses to improve embeddings, resulting in state-of-the-art performance with 8.01% and 4.01% EER on VoxCeleb1 and VOiCES datasets.

In this paper, we propose a simple but powerful unsupervised learning method for speaker recognition, namely Contrastive Equilibrium Learning (CEL), which increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss. Also, to preserve speaker discriminability, a contrastive similarity loss function is used together. Experimental results showed that the proposed CEL significantly outperforms the state-of-the-art unsupervised speaker verification systems and the best performing model achieved 8.01% and 4.01% EER on VoxCeleb1 and VOiCES evaluation sets, respectively. On top of that, the performance of the supervised speaker embedding networks trained with initial parameters pre-trained via CEL showed better performance than those trained with randomly initialized parameters.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes