CL LG SD ASSep 14, 2021

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi

arXiv:2109.06870v16.089 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses efficiency and accuracy improvements for automatic speech recognition systems, representing an incremental advancement over existing pre-trained models.

The paper tackled the trade-offs between performance and efficiency in unsupervised pre-training for speech recognition by introducing SEW, a model that achieves a 1.9x inference speedup and a 13.5% relative reduction in word error rate compared to wav2vec 2.0 under semi-supervised setups.

This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.

View on arXiv PDF Code

Similar