CVCLLGApr 27, 2023

DataComp: In search of the next generation of multimodal datasets

AI2StanfordUW
arXiv:2304.14108v5696 citationsh-index: 82
Originality Incremental advance
AI Analysis

This addresses a critical gap in the ML ecosystem by providing a standardized benchmark for dataset curation, enabling researchers to improve multimodal training data, though it is incremental as it builds on existing CLIP methods.

The paper tackles the lack of research attention on multimodal dataset design by introducing DataComp, a testbed for experimenting with image-text datasets, which led to a baseline dataset (DataComp-1B) that trains a CLIP model to achieve 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP by 3.7 percentage points.

Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes