Training a Student Expert via Semi-Supervised Foundation Model Distillation
This work addresses the problem of deploying large vision foundation models in resource-constrained settings by enabling efficient compression with limited labeled data, achieving strong performance gains over existing methods.
The authors propose a semi-supervised knowledge distillation framework to compress vision foundation models into compact experts for instance segmentation, achieving an 11× smaller student that outperforms its teacher by +11.9 AP on Cityscapes and +8.6 AP on ADE20K, and surpasses adapted teachers by +3.4 and +1.5 AP respectively.
Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our $\approx 11\times$ smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.