AS CL LG SDMar 29, 2022

LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li

arXiv:2203.15610v220.873 citationsh-index: 74Has Code

Originality Incremental advance

AI Analysis

This work addresses the limitation of large pre-trained speech models for low-resource applications by providing a lightweight and configurable alternative, representing an incremental improvement in model compression for speech processing.

The paper tackles the problem of storage-intensive pre-trained speech models like HuBERT by proposing LightHuBERT, a once-for-all Transformer compression framework that automatically finds architectures through pruning, achieving over 10^9 configurable architectures and outperforming HuBERT on ASR and five SUPERB tasks with the same size, while reducing parameters by 29% with comparable performance and obtaining a 3.5x compression ratio in three tasks with slight accuracy loss.

Self-supervised speech representation learning has shown promising results in various speech processing tasks. However, the pre-trained models, e.g., HuBERT, are storage-intensive Transformers, limiting their scope of applications under low-resource settings. To this end, we propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically by pruning structured parameters. More precisely, we create a Transformer-based supernet that is nested with thousands of weight-sharing subnets and design a two-stage distillation strategy to leverage the contextualized latent representations from HuBERT. Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $10^9$ architectures concerning the embedding dimension, attention dimension, head number, feed-forward network ratio, and network depth. LightHuBERT outperforms the original HuBERT on ASR and five SUPERB tasks with the HuBERT size, achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters, and obtains a $3.5\times$ compression ratio in three SUPERB tasks, e.g., automatic speaker verification, keyword spotting, and intent classification, with a slight accuracy loss. The code and pre-trained models are available at https://github.com/mechanicalsea/lighthubert.

View on arXiv PDF Code

Similar