LGDSFeb 27, 2024

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

arXiv:2402.17327v120 citationsh-index: 61ICML
Originality Incremental advance
AI Analysis

This work addresses the problem of reducing data requirements for training machine learning models, which is crucial for resource-constrained applications, though it appears incremental as it builds on existing sampling and clustering techniques.

The paper tackles the data selection problem for efficient model training by introducing a clustering-based sensitivity sampling method that selects a small subset of data, achieving a theoretical guarantee of approximating the average loss of the entire dataset within multiplicative and additive error bounds. It demonstrates improved performance and scalability in fine-tuning foundation models and linear regression, outperforming state-of-the-art methods.

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon λΦ_k$, where $Φ_k$ represents the $k$-means cost for the input embeddings and $λ$ is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes