Importance Analysis for Dynamic Control of Balancing Parameter in a Simple Knowledge Distillation Setting
This work provides incremental theoretical insights for improving model compression in deep learning, specifically for researchers and practitioners using knowledge distillation.
The paper tackles the problem of optimizing knowledge distillation by analyzing how to dynamically adjust the balancing parameter between distillation and downstream-task losses, showing mathematically that this parameter should be increased as the loss decreases in a simple setting.
Although deep learning models owe their remarkable success to deep and complex architectures, this very complexity typically comes at the expense of real-time performance. To address this issue, a variety of model compression techniques have been proposed, among which knowledge distillation (KD) stands out for its strong empirical performance. The KD contains two concurrent processes: (i) matching the outputs of a large, pre-trained teacher network and a lightweight student network, and (ii) training the student to solve its designated downstream task. The associated loss functions are termed the distillation loss and the downsteam-task loss, respectively. Numerous prior studies report that KD is most effective when the influence of the distillation loss outweighs that of the downstream-task loss. The influence(or importance) is typically regulated by a balancing parameter. This paper provides a mathematical rationale showing that in a simple KD setting when the loss is decreasing, the balancing parameter should be dynamically adjusted