LGCVSep 15, 2024

Finetuning CLIP to Reason about Pairwise Differences

CMU
arXiv:2409.09721v29 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses a limitation in vision-language models for applications like retrieval and classification, though it is incremental as it builds on existing CLIP methods.

The paper tackled the problem that CLIP embeddings lack structured reasoning about differences between images, by finetuning CLIP to align text descriptions of differences with embedding space differences using synthetic data. The result was improved ranking of images by attributes, enhanced zeroshot classification performance on many tasks, and new comparative prompting for inference with larger gains.

Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is that the resulting embedding space seems to lack some of the structure of its purely text-based alternatives. For instance, while text embeddings have long been noted to satisfy analogies in embedding space using vector arithmetic, CLIP has no such property. In this paper, we propose an approach to natively train CLIP in a contrastive manner to reason about differences in embedding space. We finetune CLIP so that text descriptions of differences between images correspond to their difference in image embedding space, using synthetically generated data with large language models on image-caption paired datasets. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants are larger than cats), which is useful in retrieval or constructing attribute-based classifiers, and improved zeroshot classification performance on many downstream image classification tasks. In addition, our approach enables a new mechanism for inference that we refer to as comparative prompting, where we leverage prior knowledge of text descriptions of differences between classes of interest, achieving even larger performance gains in classification. Finally, we illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space, such as in text-to-image generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes