LGAICVDec 17, 2023

A Weighted K-Center Algorithm for Data Subset Selection

arXiv:2312.10602v111 citationsh-index: 38
Originality Incremental advance
AI Analysis

This addresses the need for efficient data subset selection to reduce annotation and computation costs in deep learning, representing an incremental improvement over prior methods.

The paper tackles the problem of selecting informative and diverse subsets of training data for deep learning by developing a weighted k-center algorithm that combines uncertainty sampling and clustering objectives, achieving similar or better performance on vision datasets like CIFAR-10, CIFAR-100, and ImageNet.

The success of deep learning hinges on enormous data and large models, which require labor-intensive annotations and heavy computation costs. Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data, which can then be used to produce similar models as the ones trained with full data. Two prior methods are shown to achieve impressive results: (1) margin sampling that focuses on selecting points with high uncertainty, and (2) core-sets or clustering methods such as k-center for informative and diverse subsets. We are not aware of any work that combines these methods in a principled manner. To this end, we develop a novel and efficient factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions. To handle large datasets, we show a parallel algorithm to run on multiple machines with approximation guarantees. The proposed algorithm achieves similar or better performance compared to other strong baselines on vision datasets such as CIFAR-10, CIFAR-100, and ImageNet.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes