LGFeb 17, 2025

Pretraining Frequency Predicts Compositional Generalization of CLIP on Real-World Tasks

arXiv:2502.18326v18 citationsh-index: 36
Originality Incremental advance
AI Analysis

This work addresses the problem of sample-inefficient scaling in CLIP models for real-world applications, offering insights for data curation to improve efficiency and accuracy without increasing data volume.

The study investigated CLIP's compositional generalization on real-world tasks by predicting performance based on pretraining frequencies of individual objects, showing that CLIP can disentangle and recompose objects from its pretraining data, with performance scaling predictably with data.

We investigate the success conditions for compositional generalization of CLIP models on real-world data through performance prediction. Prior work shows that CLIP requires exponentially more pretraining data for linear performance gains on individual concepts. This sample-inefficient scaling could be mitigated if CLIP systematically understood new inputs as compositions of learned components, allowing rare observation to be mapped to common concepts. To explore CLIP's compositional generalization ability, we filter retrieval corpora for samples with object combinations not present in the pretraining corpus. We show that CLIP's performance on these samples can be accurately predicted from the pretraining frequencies of individual objects. Our findings demonstrate that CLIP learns to disentangle objects observed in its pretraining data and can recompose them straightforwardly. Additionally, we are the first to show how this ability scales with pretraining data. For data curation in practice, our results suggest that balancing object occurrences improves generalization, which should benefit CLIP's efficiency and accuracy without scaling data volume.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes