CVAIOct 16, 2024

Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation

arXiv:2410.12342v21 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of limited flexibility in knowledge distillation for machine learning practitioners by enabling effective cross-architecture transfer, though it is incremental as it builds on existing distillation methods.

The paper tackles the challenge of knowledge distillation between heterogeneous architectures by introducing an assistant model that fuses convolution and attention modules and using a spatial-agnostic InfoNCE loss, achieving state-of-the-art performance with gains up to 11.47% on CIFAR-100 and 3.67% on ImageNet-1K.

Most knowledge distillation (KD) methodologies predominantly focus on teacher-student pairs with similar architectures, such as both being convolutional neural networks (CNNs). However, the potential and flexibility of KD can be greatly improved by expanding it to novel Cross-Architecture KD (CAKD), where the knowledge of homogeneous and heterogeneous teachers can be transferred flexibly to a given student. The primary challenge in CAKD lies in the substantial feature gaps between heterogeneous models, originating from the distinction of their inherent inductive biases and module functions. To this end, we introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students. More importantly, within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions by merging convolution and attention modules derived from both student and teacher module functions. Furthermore, we observe that heterogeneous features exhibit diverse spatial distributions in CAKD, hindering the effectiveness of conventional pixel-wise mean squared error (MSE) loss. Therefore, we leverage a spatial-agnostic InfoNCE loss to align features after spatial smoothing, thereby improving the feature alignments in CAKD. Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, and MLPs, achieving state-of-the-art performance for distilled models with a maximum gain of 11.47% on CIFAR-100 and 3.67% on ImageNet-1K. Our code and models will be released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes