LGAICVMay 24, 2024

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

arXiv:2405.15613v256 citationsh-index: 71Has CodeTrans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This addresses the costly and time-consuming bottleneck of manual data curation for self-supervised learning, enabling more scalable dataset construction.

The paper tackles the problem of manual data curation for self-supervised learning by proposing an automatic clustering-based method to create large, diverse, and balanced datasets, showing that features trained on these datasets outperform uncurated data and match or exceed manually curated data across web images, satellite images, and text domains.

Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes