Diverse Subset Selection via Norm-Based Sampling and Orthogonality
This work addresses the problem of expensive data labeling for domains like medical imaging by providing a method to select informative subsets, though it is incremental as it builds on existing subset selection techniques.
The paper tackles the subset selection problem for reducing annotation costs by proposing a method that combines feature norms, randomization, and orthogonality to select diverse and informative samples from large unlabeled datasets. It shows consistent performance improvements across image and text benchmarks, including CIFAR-10/100, Tiny ImageNet, ImageNet, OrganAMNIST, and Yelp.
Large annotated datasets are crucial for the success of deep neural networks, but labeling data can be prohibitively expensive in domains such as medical imaging. This work tackles the subset selection problem: selecting a small set of the most informative examples from a large unlabeled pool for annotation. We propose a simple and effective method that combines feature norms, randomization, and orthogonality (via the Gram-Schmidt process) to select diverse and informative samples. Feature norms serve as a proxy for informativeness, while randomization and orthogonalization reduce redundancy and encourage coverage of the feature space. Extensive experiments on image and text benchmarks, including CIFAR-10/100, Tiny ImageNet, ImageNet, OrganAMNIST, and Yelp, show that our method consistently improves subset selection performance, both as a standalone approach and when integrated with existing techniques.