DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective
This addresses the need for efficient, high-performance speech models for applications like automatic speech recognition, though it is incremental as it builds on existing distillation and HuBERT methods.
The paper tackled the problem of compressing HuBERT, a self-supervised speech model, by introducing DiceHuBERT, a knowledge distillation framework that uses the same SSL objective as HuBERT, eliminating the need for additional modules. The result was improved phoneme recognition by over 21% and ASR performance by more than 14% on SUPERB benchmarks.
We introduce DiceHuBERT, a knowledge distillation framework for compressing HuBERT, a widely used self-supervised learning (SSL)-based speech foundation model. Unlike existing distillation methods that rely on layer-wise and feature-wise mapping between teacher and student models, DiceHuBERT leverages HuBERT's iterative self-distillation mechanism by directly replacing the original model with a student model. This replacement allows the student to be trained using the same SSL objective used when pre-training HuBERT, eliminating the need for additional modules or architectural constraints. Experimental results on SUPERB show that DiceHuBERT consistently outperforms existing distillation methods, improving phoneme recognition performance by over 21% and ASR performance by more than 14%. Furthermore, DiceHuBERT demonstrates competitive performance across multiple tasks, highlighting its clear advantage.