LG MLDec 16, 2025

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Divyansh Pareek, Sewoong Oh, Simon S. Du

arXiv:2512.14230v14.1h-index: 5

Originality Incremental advance

AI Analysis

This provides theoretical justification for a common practice in multimodal learning, addressing data curation challenges for researchers and practitioners, though it is incremental as it builds on existing models.

The paper tackles the problem of data quality in multimodal contrastive learning by analyzing teacher-based filtering, showing that filtering reduces error from bounds like 1/(η√n) to 1/√(ηn) or 1/√n depending on the fraction of correctly matched data.

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $η\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{η\sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{ηn}}$ in the large $η$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $η$ regime.

View on arXiv PDF

Similar