Enhancing Semi-supervised Learning with Zero-shot Pseudolabels
This work addresses the challenge of deploying machine learning in resource-constrained settings by enabling efficient training with foundation models, though it is incremental as it builds on existing SSL and zero-shot techniques.
The paper tackled the problem of high labeling costs in machine learning by proposing ZeroMatch, a semi-supervised learning framework that integrates knowledge distillation with consistency-based learning to leverage labeled data, unlabeled data, and pseudo-labels from foundation models, resulting in consistent performance improvements over standard methods across six benchmarks.
The high cost of data labeling presents a major barrier to deploying machine learning systems at scale. Semi-supervised learning (SSL) mitigates this challenge by utilizing unlabeled data alongside limited labeled examples, while the emergence of foundation models (FMs) offers powerful zero-shot capabilities that can further reduce labeling cost. However, directly fine-tuning large FMs is often impractical in resource-constrained settings, and naïvely using their pseudo-labels for unlabeled data can degrade performance due to its unreliablity or domain mismatch with target task. In this work, we introduce ZeroMatch, a novel SSL framework that integrates knowledge distillation with consistency-based learning to jointly leverage labeled data, unlabeled data, and pseudo-labels from FMs. ZeroMatch enables training compact student models using only FM inference, making it suitable for low-resource environments such as personal devices with limited compute. Experiments on six vision and language classification benchmarks show that ZeroMatch consistently outperforms standard SSL and zero-shot augmented methods, demonstrating its effectiveness and robustness across a range of foundation model qualities.