Bi-CoG: Bi-Consistency-Guided Self-Training for Vision-Language Models
This work addresses label-scarce scenarios in vision-language tasks, offering an incremental improvement over existing semi-supervised fine-tuning methods.
The paper tackles the problem of model bias and hyperparameter sensitivity in semi-supervised fine-tuning of vision-language models by proposing Bi-CoG, a method that uses bi-consistency guidance and dynamic pseudo-label assignment, achieving consistent and significant performance improvements across 14 datasets.
Exploiting unlabeled data through semi-supervised learning (SSL) or leveraging pre-trained models via fine-tuning are two prevailing paradigms for addressing label-scarce scenarios. Recently, growing attention has been given to combining fine-tuning of pre-trained vision-language models (VLMs) with SSL, forming the emerging paradigm of semi-supervised fine-tuning. However, existing methods often suffer from model bias and hyperparameter sensitivity, due to reliance on prediction consistency or pre-defined confidence thresholds. To address these limitations, we propose a simple yet effective plug-and-play methodology named $\underline{\textbf{Bi-Co}}$nsistency-$\underline{\textbf{G}}$uided Self-Training (Bi-CoG), which assigns high-quality and low-bias pseudo-labels, by simultaneously exploiting inter-model and intra-model consistency, along with an error-aware dynamic pseudo-label assignment strategy. Both theoretical analysis and extensive experiments over 14 datasets demonstrate the effectiveness of Bi-CoG, which consistently and significantly improves the performance of existing methods.