CVDec 1, 2021

Information Theoretic Representation Distillation

arXiv:2112.00459v326 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the practical adoption barrier of knowledge distillation for researchers and practitioners by reducing computational expenses.

The paper tackles the high computational cost of state-of-the-art knowledge distillation methods by introducing two entropy-inspired losses to maximize correlation and mutual information between student and teacher representations, achieving competitive performance with significantly less training overhead and setting a new state-of-the-art for binary quantization.

Despite the empirical success of knowledge distillation, current state-of-the-art methods are computationally expensive to train, which makes them difficult to adopt in practice. To address this problem, we introduce two distinct complementary losses inspired by a cheap entropy-like estimator. These losses aim to maximise the correlation and mutual information between the student and teacher representations. Our method incurs significantly less training overheads than other approaches and achieves competitive performance to the state-of-the-art on the knowledge distillation and cross-model transfer tasks. We further demonstrate the effectiveness of our method on a binary distillation task, whereby it leads to a new state-of-the-art for binary quantisation and approaches the performance of a full precision model. Code: www.github.com/roymiles/ITRD

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes