CVFeb 27, 2025

CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

arXiv:2502.19842v217.410 citationsh-index: 21Has CodeCVPR

Originality Synthesis-oriented

AI Analysis

It addresses CLIP's instability in complex visual tasks for AI researchers, though it is incremental as it focuses on analyzing existing biases rather than proposing new solutions.

This study analyzed CLIP's limitations in multi-object scenarios, revealing biases where the text encoder prioritizes first-mentioned objects and the image encoder favors larger ones, with performance drops of up to 30% in image-text matching when object size or token order changes.

Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects. Through retrieval and classification tasks, we quantify these biases across multiple CLIP variants and trace their origins to CLIP's training process, supported by analyses of the LAION dataset and training progression. Our image-text matching experiments show substantial performance drops when object size or token order changes, underscoring CLIP's instability with rephrased but semantically similar captions. Extending this to longer captions and text-to-image models like Stable Diffusion, we demonstrate how prompt order influences object prominence in generated images. For more details and access to our dataset and analysis code, visit our project repository: https://clip-oscope.github.io.

View on arXiv PDF Code

Similar