CVDec 17, 2024

CRoF: CLIP-based Robust Few-shot Learning on Noisy Labels

Shizhuo Deng, Bowen Han, Jiaqi Chen, Hao Wang, Dongyue Chen, Tong Jia

arXiv:2412.12793v12.02 citationsh-index: 7

Originality Incremental advance

AI Analysis

This addresses a critical challenge in few-shot learning for computer vision applications where noisy labels degrade performance, though it appears incremental as it builds directly on existing CLIP frameworks.

The paper tackles the problem of noisy labels in few-shot learning with CLIP models, proposing CRoF as a plug-in module that enhances robustness through task-oriented prompts and weighted label-smoothing fine-tuning, achieving superior performance over baseline CLIP models across various noise types and ratios.

Noisy labels threaten the robustness of few-shot learning (FSL) due to the inexact features in a new domain. CLIP, a large-scale vision-language model, performs well in FSL on image-text embedding similarities, but it is susceptible to misclassification caused by noisy labels. How to enhance domain generalization of CLIP on noisy data within FSL tasks is a critical challenge. In this paper, we provide a novel view to mitigate the influence of noisy labels, CLIP-based Robust Few-shot learning (CRoF). CRoF is a general plug-in module for CLIP-based models. To avoid misclassification and confused label embedding, we design the few-shot task-oriented prompt generator to give more discriminative descriptions of each category. The proposed prompt achieves larger distances of inter-class textual embedding. Furthermore, rather than fully trusting zero-shot classification by CLIP, we fine-tune CLIP on noisy few-shot data in a new domain with a weighting strategy like label-smooth. The weights for multiple potentially correct labels consider the relationship between CLIP's prior knowledge and original label information to ensure reliability. Our multiple label loss function further supports robust training under this paradigm. Comprehensive experiments show that CRoF, as a plug-in, outperforms fine-tuned and vanilla CLIP models on different noise types and noise ratios.

View on arXiv PDF

Similar