Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization
This work addresses data scarcity challenges in semi-supervised learning by improving knowledge distillation from VLMs, offering a practical solution with minimal computational overhead for researchers and practitioners in computer vision.
The paper tackles the problem of effectively transferring the generalization capabilities of vision-language models (VLMs) into task-specific models via knowledge distillation, and proposes Dual-Head Optimization (DHO) to resolve gradient conflicts, achieving new state-of-the-art performance on ImageNet semi-supervised learning and out-of-distribution generalization across 15 datasets.
Semi-supervised learning (SSL) has emerged as a practical solution for addressing data scarcity challenges by leveraging unlabeled data. Recently, vision-language models (VLMs), pre-trained on massive image-text pairs, have demonstrated remarkable zero-/few-shot performance that often surpasses SSL approaches due to their exceptional generalization capabilities. This gap motivates us to question: how can we effectively harness the powerful generalization capabilities of VLMs into task-specific models? Knowledge distillation (KD) offers a natural framework for transferring VLM capabilities, but we identify that it suffers from gradient conflicts between supervised and distillation losses. To address this challenge, we propose Dual-Head Optimization (DHO), which introduces dual prediction heads for each distinct signal. We observe that DHO resolves gradient conflicts, enabling improved feature learning compared to single-head KD baselines, with practical benefits of minimal computational overhead and test-time hyperparameter tuning without retraining. Extensive experiments across 15 datasets show that DHO consistently outperforms KD baselines, often outperforming teacher models with smaller student models. DHO also achieves new state-of-the-art performance on both in-distribution ImageNet semi-supervised learning and out-of-distribution generalization across ImageNet variants. We publicly release our code and model checkpoints to facilitate future research at https://github.com/erjui/DHO.