Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering
This work addresses the problem of enhancing content representations in speech processing for applications like recognition and unit discovery, representing an incremental improvement over prior methods.
The paper tackled the challenge of improving self-supervised speech representation models for content-related tasks by proposing speaker-invariant clustering (Spin), a method that disentangles speaker information and preserves content representations with only 45 minutes of fine-tuning, resulting in outperformance in speech recognition and acoustic unit discovery.
Self-supervised speech representation models have succeeded in various tasks, but improving them for content-related problems using unlabeled data is challenging. We propose speaker-invariant clustering (Spin), a novel self-supervised learning method that clusters speech representations and performs swapped prediction between the original and speaker-perturbed utterances. Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU. Spin improves pre-trained networks and outperforms prior methods in speech recognition and acoustic unit discovery.