LGAIMay 11, 2024

AdaKD: Dynamic Knowledge Distillation of ASR models using Adaptive Loss Weighting

arXiv:2405.08019v15 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in model compression for automatic speech recognition, offering an incremental improvement over standard knowledge distillation techniques.

The paper tackles the problem of suboptimal performance in knowledge distillation due to fixed loss weights by proposing AdaKD, which adaptively weighs task-specific and distillation losses at the instance level based on sample difficulty. Experiments show it outperforms conventional methods and existing instance-level loss functions.

Knowledge distillation, a widely used model compression technique, works on the basis of transferring knowledge from a cumbersome teacher model to a lightweight student model. The technique involves jointly optimizing the task specific and knowledge distillation losses with a weight assigned to them. Despite these weights playing a crucial role in the performance of the distillation process, current methods provide equal weight to both losses, leading to suboptimal performance. In this paper, we propose Adaptive Knowledge Distillation, a novel technique inspired by curriculum learning to adaptively weigh the losses at instance level. This technique goes by the notion that sample difficulty increases with teacher loss. Our method follows a plug-and-play paradigm that can be applied on top of any task-specific and distillation objectives. Experiments show that our method performs better than conventional knowledge distillation method and existing instance-level loss functions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes