Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?
This work addresses the problem of understanding SSL mechanisms for speaker recognition, providing insights for researchers in speech and speaker processing, but it is incremental as it builds on existing SSL methods.
The paper investigates why self-supervised learning (SSL) designed for speech recognition improves speaker verification, finding that mask speech prediction loss, data scale, and model size are key factors, with the SSL quantizer having minor impact, based on experiments on the Voxceleb-1 dataset.
Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.