Application of Knowledge Distillation to Multi-task Speech Representation Learning
This work addresses the problem of model size for edge AI deployments in speech processing, representing an incremental improvement by adapting existing knowledge distillation techniques to multi-task scenarios.
The paper tackles the challenge of deploying large speech representation learning models on edge AI devices by applying knowledge distillation and joint fine-tuning to multi-task voice-activated tasks, achieving a nearly 75% reduction in model size with only minor accuracy and error rate degradations (0.1% and 0.9%, respectively).
Model architectures such as wav2vec 2.0 and HuBERT have been proposed to learn speech representations from audio waveforms in a self-supervised manner. When they are combined with downstream tasks such as keyword spotting and speaker verification, they provide state-of-the-art performance. However, these models use a large number of parameters, the smallest version of which has 95 million parameters. This constitutes a challenge for edge AI device deployments. In this paper, we investigate the application of knowledge distillation to speech representation learning (SRL) models followed by joint fine-tuning with multiple downstream voice-activated tasks. In our experiments on two such tasks, our approach results in nearly 75% reduction in model size while suffering only 0.1% accuracy and 0.9% equal error rate degradation compared to the full-size model. In addition, we show that fine-tuning the SRL models results in a significant performance boost compared to using frozen SRL models.