AS SDMar 29, 2021

Improved Meta-Learning Training for Speaker Verification

arXiv:2103.15421v23.32 citations

Originality Synthesis-oriented

AI Analysis

This work addresses speaker verification, a domain-specific task, with incremental improvements to meta-learning training methods.

The authors tackled the problem of improving meta-learning training for speaker verification by introducing two methods: joint training with transformation coefficients and random erasing augmentation with contrastive loss. The combined approach achieved consistent improvements over existing meta-learning frameworks on SITW and VOiCES databases.

Meta-learning has recently become a research hotspot in speaker verification (SV). We introduce two methods to improve the meta-learning training for SV in this paper. For the first method, a backbone embedding network is first jointly trained with the conventional cross entropy loss and prototypical networks (PN) loss. Then, inspired by speaker adaptive training in speech recognition, additional transformation coefficients are trained with only the PN loss. The transformation coefficients are used to modify the original backbone embedding network in the x-vector extraction process. Furthermore, the random erasing data augmentation technique is applied to all support samples in each episode to construct positive pairs, and a contrastive loss between the augmented and the original support samples is added to the objective in model training. Experiments are carried out on the SITW and VOiCES databases. Both of the methods can obtain consistent improvements over existing meta-learning training frameworks. By combining these two methods, we can observe further improvements on these two databases.

View on arXiv PDF

Similar