CVLGFeb 5, 2025

CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

arXiv:2502.03566v222 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses a critical limitation in CLIP for compositional reasoning, which is important for applications in vision-language tasks, though it is an incremental improvement over existing methods.

The paper tackled CLIP's failure to bind attributes to objects in compositional tasks by identifying that the issue stems from cross-modal alignment via cosine similarity, and proposed LABCLIP, a linear transformation method that significantly improves binding accuracy.

CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP's ability to bind attributes to correct objects, thereby enhancing its compositional understanding. The code is available at https://github.com/kdariina/CLIP-not-BoW-unimodally.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes