CVCLSep 23, 2024

Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP

arXiv:2409.15035v128 citationsh-index: 13
Originality Synthesis-oriented
AI Analysis

This addresses a specific issue for users of CLIP-based applications, but it is incremental as it focuses on an empirical study of an existing model.

The paper investigates quantity bias in CLIP, finding that it leads to discrepancies in object counts in image generation tasks, with empirical results showing this bias affects downstream reliability.

CLIP has demonstrated great versatility in adapting to various downstream tasks, such as image editing and generation, visual question answering, and video understanding. However, CLIP-based applications often suffer from misunderstandings regarding user intent, leading to discrepancies between the required number of objects and the actual outputs in image generation tasks. In this work, we empirically investigate the quantity bias in CLIP. By carefully designing different experimental settings and datasets, we comprehensively evaluate CLIP's understanding of quantity from text, image, and cross-modal perspectives. Our experimental results reveal a quantity bias in CLIP embeddings, impacting the reliability of downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes