CL LGNov 10, 2019

Effectiveness of self-supervised pre-training for speech recognition

Alexei Baevski, Michael Auli, Abdelrahman Mohamed

arXiv:1911.03912v310.8158 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of reducing labeled data requirements for speech recognition systems, showing significant improvements with minimal supervision, though it is incremental in refining existing self-supervised techniques.

The paper tackled the problem of speech recognition with limited labeled data by comparing self-supervised pre-training methods, finding that quantization-based approaches like vq-wav2vec combined with BERT fine-tuning achieve a 25% WER reduction on test-other and near-state-of-the-art performance with only 10 hours of labeled data.

We compare self-supervised representation learning algorithms which either explicitly quantize the audio data or learn representations without quantization. We find the former to be more accurate since it builds a good vocabulary of the data through vq-wav2vec [1] to enable learning of effective representations in subsequent BERT training. Different to previous work, we directly fine-tune the pre-trained BERT models on transcribed speech using a Connectionist Temporal Classification (CTC) loss instead of feeding the representations into a task-specific model. We also propose a BERT-style model learning directly from the continuous audio data and compare pre-training on raw audio to spectral features. Fine-tuning a BERT model on 10 hour of labeled Librispeech data with a vq-wav2vec vocabulary is almost as good as the best known reported system trained on 100 hours of labeled data on testclean, while achieving a 25% WER reduction on test-other. When using only 10 minutes of labeled data, WER is 25.2 on test-other and 16.3 on test-clean. This demonstrates that self-supervision can enable speech recognition systems trained on a near-zero amount of transcribed data.

View on arXiv PDF

Similar