CL SD ASApr 27, 2022

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu, Furu Wei

Microsoft

arXiv:2204.12765v25.653 citationsh-index: 102

Originality Incremental advance

AI Analysis

This work addresses the problem of understanding SSL mechanisms for speaker recognition, providing insights for researchers in speech and speaker processing, but it is incremental as it builds on existing SSL methods.

The paper investigates why self-supervised learning (SSL) designed for speech recognition improves speaker verification, finding that mask speech prediction loss, data scale, and model size are key factors, with the SSL quantizer having minor impact, based on experiments on the Voxceleb-1 dataset.

Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.

View on arXiv PDF

Similar