LGNov 6, 2024

A Bayesian Approach to Data Point Selection

Xinnuo Xu, Minyoung Kim, Royson Lee, Brais Martinez, Timothy Hospedales

arXiv:2411.03768v17.93 citationsh-index: 11NIPS

Originality Incremental advance

AI Analysis

This addresses the problem of efficient data selection for deep learning practitioners, though it appears incremental as it builds on existing Bayesian and sampling techniques.

The authors tackled the computational and theoretical limitations of existing bi-level optimization approaches to data point selection by proposing a Bayesian method that treats selection as posterior inference, achieving comparable efficiency to SGD while scaling effectively to large language models.

Data point selection (DPS) is becoming a critical topic in deep learning due to the ease of acquiring uncurated training data compared to the difficulty of obtaining curated or processed data. Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation, which is demanding in terms of memory and computation, and exhibits some theoretical defects regarding minibatches. Thus, we propose a novel Bayesian approach to DPS. We view the DPS problem as posterior inference in a novel Bayesian model where the posterior distributions of the instance-wise weights and the main neural network parameters are inferred under a reasonable prior and likelihood model. We employ stochastic gradient Langevin MCMC sampling to learn the main network and instance-wise weights jointly, ensuring convergence even with minibatches. Our update equation is comparable to the widely used SGD and much more efficient than existing BLO-based methods. Through controlled experiments in both the vision and language domains, we present the proof-of-concept. Additionally, we demonstrate that our method scales effectively to large language models and facilitates automated per-task optimization for instruction fine-tuning datasets.

View on arXiv PDF

Similar