CLMar 29, 2021

Shrinking Bigfoot: Reducing wav2vec 2.0 footprint

arXiv:2103.15760v2667 citations
AI Analysis

This work addresses the bottleneck of deploying large speech recognition models in production settings, making them more cost-effective and environmentally friendly, though it is incremental as it applies existing compression techniques to a new model.

The paper tackles the problem of high inference latency and large size of wav2vec 2.0 for speech recognition by applying model compression methods, resulting in a student model that is 2 times faster and 4.8 times smaller with only a 7% degradation in word error rate, and a quantized model that is 3.6 times smaller with only a 0.1% degradation.

Wav2vec 2.0 is a state-of-the-art speech recognition model which maps speech audio waveforms into latent representations. The largest version of wav2vec 2.0 contains 317 million parameters. Hence, the inference latency of wav2vec 2.0 will be a bottleneck in production, leading to high costs and a significant environmental footprint. To improve wav2vec's applicability to a production setting, we explore multiple model compression methods borrowed from the domain of large language models. Using a teacher-student approach, we distilled the knowledge from the original wav2vec 2.0 model into a student model, which is 2 times faster and 4.8 times smaller than the original model. This increase in performance is accomplished with only a 7% degradation in word error rate (WER). Our quantized model is 3.6 times smaller than the original model, with only a 0.1% degradation in WER. To the best of our knowledge, this is the first work that compresses wav2vec 2.0.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes