CVCLJul 28, 2023

Cross-Modal Concept Learning and Inference for Vision-Language Models

CMU
arXiv:2307.15460v121 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses a bottleneck in fine-tuning vision-language models for tasks like few-shot learning and domain generalization, offering incremental improvements over existing methods.

The paper tackles the problem of whole image matching in vision-language models being ineffective for fine-tuning due to varying semantic objects and concepts across images, by developing a cross-modal concept learning and inference method that automatically learns visual concepts from text and uses them for image classification. The result shows improvements of up to 8.0% on few-shot learning and up to 1.3% on domain generalization compared to state-of-the-art methods.

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance upon the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few-shot learning and by up to 1.3% for domain generalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes