SD LG ASSep 30, 2021

Fine-tuning wav2vec2 for speaker recognition

arXiv:2109.15053v222.8130 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses speaker recognition for applications like security or personalization, but it is incremental as it adapts an existing framework to a new task with modest improvements.

This paper tackles speaker recognition by adapting the wav2vec2 framework, originally for speech recognition, to this task, achieving a 1.88% EER on the extended voxceleb1 test set, which is close to a 1.69% EER baseline.

This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant, w2v2-aam, achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at https://github.com/nikvaessen/w2v2-speaker.

View on arXiv PDF Code

Similar