Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
This work advances speech recognition accuracy for applications like transcription, though it is incremental as it builds on existing methods.
The paper tackled improving automatic speech recognition by combining semi-supervised learning techniques, achieving word-error-rates of 1.4%/2.6% on LibriSpeech test sets, which beat the previous state-of-the-art of 1.7%/3.3%.
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%.