CLIP-driven Zero-shot Learning with Ambiguous Labels
This work is significant for researchers and practitioners working on zero-shot learning in real-world scenarios where data often contains noisy or ambiguous labels, improving model robustness.
This paper addresses the problem of zero-shot learning (ZSL) when training data contains ambiguous labels. The proposed CLIP-PZSL framework leverages CLIP to extract features and a semantic mining block to refine label embeddings, along with a partial zero-shot loss to handle label ambiguity and progressively identify ground-truth labels.
Zero-shot learning (ZSL) aims to recognize unseen classes by leveraging semantic information from seen classes, but most existing methods assume accurate class labels for training instances. However, in real-world scenarios, noise and ambiguous labels can significantly reduce the performance of ZSL. To address this, we propose a new CLIP-driven partial label zero-shot learning (CLIP-PZSL) framework to handle label ambiguity. First, we use CLIP to extract instance and label features. Then, a semantic mining block fuses these features to extract discriminative label embeddings. We also introduce a partial zero-shot loss, which assigns weights to candidate labels based on their relevance to the instance and aligns instance and label embeddings to minimize semantic mismatch. As the training goes on, the ground-truth labels are progressively identified, and the refined labels and label embeddings in turn help improve the semantic alignment of instance and label features. Comprehensive experiments on several datasets demonstrate the advantage of CLIP-PZSL.