LGCVSep 24, 2023

Devil in the Number: Towards Robust Multi-modality Data Filter

arXiv:2309.13770v12 citationsh-index: 27
Originality Incremental advance
AI Analysis

This work addresses data filtering for web-scale multi-modality datasets to reduce training costs, but it is incremental as it builds on existing CLIP-based methods.

The paper tackles the problem of redundant numerical information in multi-modality data filtering, which negatively impacts CLIP scores, and shows that a text-based CLIP filter improves performance by 3.6% on a benchmark compared to the top-ranked method.

In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs. For instance, LAION papers employs the CLIP score filter to select data with CLIP scores surpassing a certain threshold. On the other hand, T-MARS achieves high-quality data filtering by detecting and masking text within images and then filtering by CLIP score. Through analyzing the dataset, we observe a significant proportion of redundant information, such as numbers, present in the textual content. Our experiments on a subset of the data unveil the profound impact of these redundant elements on the CLIP scores. A logical approach would involve reevaluating the CLIP scores after eliminating these influences. Experimentally, our text-based CLIP filter outperforms the top-ranked method on the ``small scale" of DataComp (a data filtering benchmark) on ImageNet distribution shifts, achieving a 3.6% performance improvement. The results also demonstrate that our proposed text-masked filter outperforms the original CLIP score filter when selecting the top 40% of the data. The impact of numbers on CLIP and their handling provide valuable insights for improving the effectiveness of CLIP training, including language rewrite techniques.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes