LGAISep 27, 2025

Knowledge distillation through geometry-aware representational alignment

arXiv:2509.25253v1h-index: 12
Originality Incremental advance
AI Analysis

This work addresses the problem of improving knowledge distillation for model compression, which is incremental as it builds on existing feature distillation methods by integrating geometry-aware losses.

The paper tackled the problem of feature-based knowledge distillation by showing that existing methods fail to capture feature structure even at zero loss, and proposed using Procrustes distance and Frobenius norm of Feature Gram Matrix as losses, resulting in statistically significant improvements of up to 2 percentage points in classification and instruction-following tasks for BERT and OPT models.

Knowledge distillation is a common paradigm for transferring capabilities from larger models to smaller ones. While traditional distillation methods leverage a probabilistic divergence over the output of the teacher and student models, feature-based distillation methods often minimize variants of Euclidean norms between the hidden layer representations. The main goal is for the student to mimic the structure of the feature space of the teacher. In this work, we theoretically show that existing feature distillation methods, such as projection based mean squared loss or Centered Kernel Alignment (CKA), cannot capture the feature structure, even under zero loss. We then motivate the use of Procrustes distance and the Frobenius norm of Feature Gram Matrix, distances already common in the context of measuring representational alignment, as distillation losses. We show that feature distillation through our method showcases statistically significant improvement in distillation performance across language models families (BERT and OPT) in classification and instruction-following tasks by up to 2 percentage points, showcasing the potential of integrating feature geometry into existing distillation methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes