GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning
This addresses group robustness issues in vision-language models for tasks like image classification, though it is incremental as it builds on existing prompt learning methods.
The paper tackles the problem of spurious correlations in fine-tuned vision-language models due to subgroup imbalance, proposing GroupCoOp to enhance group robustness and achieving state-of-the-art results on five benchmarks across CLIP architectures with only 0.016% parameter training.
Parameter-efficient fine-tuning (PEFT) of vision-language models (VLMs) excels in various vision tasks thanks to the rich knowledge and generalization ability of VLMs. However, recent studies revealed that such fine-tuned VLMs are vulnerable to spurious correlations stemming from the subgroup imbalance in the fine-tuning datasets. To resolve this issue, we propose Group Context Optimization (GroupCoOp), a simple and effective debiased fine-tuning algorithm that enhances the group robustness of fine-tuned VLMs. Its key idea is to employ group-specific text prompts as group representatives serving as multiple classifiers for their target class. The rich semantic knowledge of the text encoder of VLM enables the discovery of effective group prompts even for groups with a small number of training samples. Leveraging the group prompts for each class addresses the issues caused by the group-imbalanced training set, such as the neglect of minority groups and the scattered distribution of each class in the embedding space. GroupCoOp achieved the best results on five benchmarks across five CLIP architectures and occasionally outperformed prior methods that fine-tune the entire network, despite training only 0.016\% of the network's parameters.