Feature Enhancement with Deep Feature Losses for Speaker Verification
This work addresses the challenge of speaker verification in noisy, real-world environments, particularly for applications like children's recordings, and represents an incremental improvement over existing methods.
The paper tackles the problem of speaker verification's poor generalization in adverse environments by proposing a feature-domain supervised denoising method using Deep Feature Loss, resulting in consistent gains over state-of-the-art systems, including relative improvements of 10.38% in minDCF and 12.40% in EER on the BabyTrain corpus.
Speaker Verification still suffers from the challenge of generalization to novel adverse environments. We leverage on the recent advancements made by deep learning based speech enhancement and propose a feature-domain supervised denoising based solution. We propose to use Deep Feature Loss which optimizes the enhancement network in the hidden activation space of a pre-trained auxiliary speaker embedding network. We experimentally verify the approach on simulated and real data. A simulated testing setup is created using various noise types at different SNR levels. For evaluation on real data, we choose BabyTrain corpus which consists of children recordings in uncontrolled environments. We observe consistent gains in every condition over the state-of-the-art augmented Factorized-TDNN x-vector system. On BabyTrain corpus, we observe relative gains of 10.38% and 12.40% in minDCF and EER respectively.