CV AI CL LG MMOct 13, 2025

Data or Language Supervision: What Makes CLIP Better than DINO?

Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, Serena Yeung-Levy

Stanford

arXiv:2510.11835v119.010 citationsh-index: 19EMNLP

Originality Synthesis-oriented

AI Analysis

This provides insights for researchers designing vision encoders in vision-language models, though it is incremental as it clarifies existing methods rather than introducing new ones.

The study investigated whether CLIP's advantage over DINO in vision-language models stems from language supervision or larger training data, finding that CLIP excels in text-intensive tasks while DINO performs better on vision-centric ones, with controlled experiments showing similar ImageNet accuracy.

CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the same architecture, dataset, and training configuration -- achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.

View on arXiv PDF

Similar