CLApr 19, 2024

Efficient infusion of self-supervised representations in Automatic Speech Recognition

arXiv:2404.12628v1h-index: 4
Originality Incremental advance
AI Analysis

This work addresses the computational inefficiency of using SSL models in ASR for researchers and practitioners, though it is incremental as it builds on existing SSL integration methods.

The paper tackled the problem of efficiently integrating self-supervised learned (SSL) models like Wav2vec and HuBERT into automatic speech recognition (ASR) systems to avoid slow training and high computational costs. It proposed two simple methods—framewise addition and cross-attention—that achieved significant performance gains on Librispeech and Tedlium datasets while keeping model sizes comparable to standard systems.

Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional ASR systems. While some approaches suggest incorporating these models as a trainable encoder or a learnable frontend, training such systems is extremely slow and requires a lot of computation cycles. In this work, we propose two simple approaches that use (1) framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model(s) into the ASR architecture, resulting in models that are comparable in size with standard encoder-decoder conformer systems while also avoiding the usage of SSL models during training. Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets compared to baselines. We further provide detailed analysis and ablation studies that demonstrate the effectiveness of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes