Contrastive-mixup learning for improved speaker verification
This work addresses speaker verification, a domain-specific problem in speech processing, by improving performance with limited data, though it is incremental as it builds on existing mixup and metric learning techniques.
The paper tackles speaker verification with limited training data by proposing contrastive-mixup, a novel augmentation strategy that combines mixup with prototypical loss for metric learning, resulting in a 16% relative reduction in error rate on the VoxCeleb database.
This paper proposes a novel formulation of prototypical loss with mixup for speaker verification. Mixup is a simple yet efficient data augmentation technique that fabricates a weighted combination of random data point and label pairs for deep neural network training. Mixup has attracted increasing attention due to its ability to improve robustness and generalization of deep neural networks. Although mixup has shown success in diverse domains, most applications have centered around closed-set classification tasks. In this work, we propose contrastive-mixup, a novel augmentation strategy that learns distinguishing representations based on a distance metric. During training, mixup operations generate convex interpolations of both inputs and virtual labels. Moreover, we have reformulated the prototypical loss function such that mixup is enabled on metric learning objectives. To demonstrate its generalization given limited training data, we conduct experiments by varying the number of available utterances from each speaker in the VoxCeleb database. Experimental results show that applying contrastive-mixup outperforms the existing baseline, reducing error rate by 16% relatively, especially when the number of training utterances per speaker is limited.