Wav2Vec-Aug: Improved self-supervised training with limited data
This addresses the challenge of limited unlabeled data for many languages in speech SSL, though it appears incremental as it builds on Wav2Vec 2.0.
The paper tackles the problem of applying self-supervised learning to speech domains with limited data by using data augmentation for Wav2Vec 2.0 pretraining, achieving up to a 13% relative improvement in word error rate on Librispeech benchmarks.
Self-supervised learning (SSL) of speech representations has received much attention over the last few years but most work has focused on languages and domains with an abundance of unlabeled data. However, for many languages there is a shortage even in the unlabeled data which limits the effectiveness of SSL. In this work, we focus on the problem of applying SSL to domains with limited available data by leveraging data augmentation for Wav2Vec 2.0 pretraining. Further, we propose improvements to each component of the model which result in a combined relative word error rate (WER) improvement of up to 13% compared to Wav2Vec 2.0 on Librispeech test-clean / other.