CVLGMLJul 15, 2022

Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning

Stanford
arXiv:2207.07635v145 citationsh-index: 102
Originality Incremental advance
AI Analysis

This work addresses the debate on language supervision in vision models for researchers, providing controlled insights and practical prescriptions, though it is incremental as it builds on existing CLIP methods.

The study investigates whether language-supervised models like CLIP produce more transferable visual representations than image-only methods, finding that CLIP outperforms when pre-training data is large with descriptive captions, but underperforms in other settings, and proposes improvements to better utilize language information.

The development of CLIP [Radford et al., 2021] has sparked a debate on whether language supervision can result in vision models with more transferable representations than traditional image-only methods. Our work studies this question through a carefully controlled comparison of two approaches in terms of their ability to learn representations that generalize to downstream classification tasks. We find that when the pre-training dataset meets certain criteria -- it is sufficiently large and contains descriptive captions with low variability -- image-only methods do not match CLIP's transfer performance, even when they are trained with more image data. However, contrary to what one might expect, there are practical settings in which these criteria are not met, wherein added supervision through captions is actually detrimental. Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes