LG DSFeb 27, 2024

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder

arXiv:2402.17327v116.420 citationsh-index: 61ICML

Originality Incremental advance

AI Analysis

This work addresses the problem of reducing data requirements for training machine learning models, which is crucial for resource-constrained applications, though it appears incremental as it builds on existing sampling and clustering techniques.

The paper tackles the data selection problem for efficient model training by introducing a clustering-based sensitivity sampling method that selects a small subset of data, achieving a theoretical guarantee of approximating the average loss of the entire dataset within multiplicative and additive error bounds. It demonstrates improved performance and scalability in fine-tuning foundation models and linear regression, outperforming state-of-the-art methods.

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon λΦ_k$, where $Φ_k$ represents the $k$-means cost for the input embeddings and $λ$ is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

View on arXiv PDF

Similar