CVLGNov 25, 2025

Concept-Aware Batch Sampling Improves Language-Image Pretraining

arXiv:2511.20643v11 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses data curation challenges for vision-language model training, offering an open-source alternative to proprietary methods, though it is incremental as it builds on existing batch sampling and concept annotation approaches.

The paper tackles the problem of data curation for vision-language models by proposing Concept-Aware Batch Sampling (CABS), a flexible online method that constructs batches based on target concept distributions, and demonstrates significant benefits across 28 benchmarks for CLIP/SigLIP models.

What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes