AS LG SDJul 23, 2025

Clustering-based hard negative sampling for supervised contrastive speaker verification

Piotr Masztalski, Michał Romaniuk, Jakub Żak, Mateusz Matuszewski, Konrad Kowalczyk

arXiv:2507.17540v12.31 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of improving speaker verification accuracy for applications like biometrics, though it is incremental as it builds on existing contrastive learning methods.

The paper tackles the challenge of effectively using hard negative pairs in supervised contrastive learning for speaker verification by proposing CHNS, a clustering-based sampling method that adjusts batch composition to optimize the ratio of hard and easy negatives, resulting in up to 18% relative improvement in EER and minDCF on the VoxCeleb dataset.

In speaker verification, contrastive learning is gaining popularity as an alternative to the traditionally used classification-based approaches. Contrastive methods can benefit from an effective use of hard negative pairs, which are different-class samples particularly challenging for a verification model due to their similarity. In this paper, we propose CHNS - a clustering-based hard negative sampling method, dedicated for supervised contrastive speaker representation learning. Our approach clusters embeddings of similar speakers, and adjusts batch composition to obtain an optimal ratio of hard and easy negatives during contrastive loss calculation. Experimental evaluation shows that CHNS outperforms a baseline supervised contrastive approach with and without loss-based hard negative sampling, as well as a state-of-the-art classification-based approach to speaker verification by as much as 18 % relative EER and minDCF on the VoxCeleb dataset using two lightweight model architectures.

View on arXiv PDF

Similar