CVLGMar 16

Self-Distillation of Hidden Layers for Self-Supervised Representation Learning

arXiv:2603.1555371.4h-index: 6
Predicted impact top 41% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses training instability and computational inefficiency in self-supervised learning for computer vision, offering a novel approach that is incremental but impactful for domain-specific applications.

The paper tackled the problem of bridging generative and predictive self-supervised learning methods by introducing Bootleg, which predicts latent representations from multiple hidden layers of a teacher network, resulting in significant performance gains such as +10% over I-JEPA on ImageNet-1K classification and improvements on semantic segmentation tasks.

The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes