DQ-Whisper: Joint Distillation and Quantization for Efficient Multilingual Speech Recognition
This work addresses efficiency and multilingual capability issues in speech recognition models, but it is incremental as it builds on existing Whisper models.
The authors tackled the problem of the curse of multilinguality in small Whisper models by proposing DQ-Whisper, a joint distillation and quantization framework, achieving up to a 5.18x reduction in model size with marginal performance degradation.
As a popular multilingual and multitask pre-trained speech model, Whisper has the problem of curse of multilinguality. To enhance multilingual capabilities in small Whisper models, we propose DQ-Whisper, a novel joint distillation and quantization framework to compress Whisper for efficient inference. Firstly, we propose a novel dynamic matching distillation strategy. Then, a quantization-aware distillation framework is introduced to integrate quantization with distillation. Experimental results on various multilingual datasets show that our suggested distillation approach can effectively enhance the multilingual capabilities of small Whisper models without increasing computational costs. Up to 5.18x reduction in model size is achieved with marginal performance degradation. In addition, quantization is compatible with distillation, which can result in a higher compression rate.