CLJun 9, 2024

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah

arXiv:2406.05661v42.75 citations

Originality Incremental advance

AI Analysis

This work improves self-supervised learning for speech recognition, offering incremental gains for researchers and practitioners in ASR.

The paper tackled the performance gap between HuBERT and data2vec in speech representation learning by proposing MS-HuBERT, which addresses pre-training and inference mismatch and uses a multicluster loss, resulting in a 5% average improvement over vanilla HuBERT on the Librispeech ASR benchmark.

In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR). However, HuBERT's performance lags behind data2vec due to disparities in pre-training strategies. In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. The resulting method is, MS-HuBERT, an end-to-end self-supervised pre-training method for learning robust speech representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits. Additionally, we demonstrate that the learned embeddings obtained during pre-training encode essential information for improving performance of content based tasks such as ASR.

View on arXiv PDF

Similar