Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation
This work addresses the challenge of adapting VLMs for efficient downstream training without manual annotation, which is incremental as it builds on existing noisy label learning methods by handling instance-dependent noise.
The paper tackles the problem of learning from noisy partial labels generated by pre-trained vision-language models (VLMs) for downstream tasks, proposing a collaborative consistency regularization (Co-Reg) method that achieves improved accuracy by efficiently mining latent information from these labels.
In the context of noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators. With the emergence of high-performance pre-trained vision-language models (VLMs) such as CLIP, LLaVA and GPT-4V, the direction of using these models to replace time-consuming manual annotation workflows and achieve ``manual-annotation-free" training for downstream tasks has become a highly promising research avenue. This paper focuses on learning from noisy partial labels annotated by pre-trained VLMs and proposes an innovative collaborative consistency regularization (Co-Reg) method. Unlike the symmetric noise primarily addressed in traditional noisy label learning, the noise generated by pre-trained models is instance-dependent, embodying the underlying patterns of the pre-trained models themselves, which significantly increases the learning difficulty for the model. To address this, we simultaneously train two neural networks that implement collaborative purification of training labels through a ``Co-Pseudo-Labeling" mechanism, while enforcing consistency regularization constraints in both the label space and feature representation space. Specifically, we construct multiple anti-overfitting mechanisms that efficiently mine latent information from noisy partially labeled samples including alternating optimization of contrastive feature representations and pseudo-labels, as well as maintaining prototypical class vectors in the shared feature space.