LGNEDec 19, 2014

FitNets: Hints for Thin Deep Nets

arXiv:1412.6550v44665 citations
Originality Incremental advance
AI Analysis

This addresses the problem of training efficiency and model compression for machine learning practitioners, offering a method to create faster or better-performing networks, though it is incremental on existing distillation techniques.

The paper tackles the difficulty of training deep neural networks by extending knowledge distillation to allow a thinner and deeper student network to learn from a teacher's intermediate representations, not just outputs, improving training and performance. For instance, on CIFAR-10, a student with 10.4 times fewer parameters outperforms a state-of-the-art teacher.

While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes