CLSDASFeb 3, 2022

Self-supervised Learning with Random-projection Quantizer for Speech Recognition

arXiv:2202.01855v2248 citations
AI Analysis

This work addresses speech recognition for applications requiring streaming and multilingual capabilities, offering a flexible and effective approach, though it appears incremental as it builds on existing self-supervised learning paradigms.

The paper tackles speech recognition by proposing a self-supervised learning method that uses a random-projection quantizer to predict masked speech signals, achieving similar word-error-rates on LibriSpeech as prior non-streaming models and lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models, with significant improvements on multilingual tasks.

We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes