CV LGOct 5, 2023

Investigating the Limitation of CLIP Models: The Worst-Performing Categories

Jie-Jing Shao, Jiang-Xin Shi, Xiao-Wen Yang, Lan-Zhe Guo, Yu-Feng Li

arXiv:2310.03324v111.020 citationsh-index: 18

Originality Incremental advance

AI Analysis

This addresses a risk-sensitive issue for users of CLIP models in applications where certain categories are critical, though it is incremental as it builds on existing CLIP frameworks.

The paper tackles the problem of CLIP models performing poorly on specific worst-performing categories, such as 0% accuracy on 10 ImageNet categories despite 64.1% overall accuracy, and proposes a method that boosts accuracy on these categories to 5.2% without manual intervention.

Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts, enabling zero-shot recognition on downstream tasks. It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts. However, we found that their performance in the worst categories is significantly inferior to the overall performance. For example, on ImageNet, there are a total of 10 categories with class-wise accuracy as low as 0\%, even though the overall performance has achieved 64.1\%. This phenomenon reveals the potential risks associated with using CLIP models, particularly in risk-sensitive applications where specific categories hold significant importance. To address this issue, we investigate the alignment between the two modalities in the CLIP model and propose the Class-wise Matching Margin (\cmm) to measure the inference confusion. \cmm\ can effectively identify the worst-performing categories and estimate the potential performance of the candidate prompts. We further query large language models to enrich descriptions of worst-performing categories and build a weighted ensemble to highlight the efficient prompts. Experimental results clearly verify the effectiveness of our proposal, where the accuracy on the worst-10 categories on ImageNet is boosted to 5.2\%, without manual prompt engineering, laborious optimization, or access to labeled validation data.

View on arXiv PDF

Similar