Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond
This work addresses the problem of reducing data requirements for training machine learning models, which is crucial for resource-constrained applications, though it appears incremental as it builds on existing sampling and clustering techniques.
The paper tackles the data selection problem for efficient model training by introducing a clustering-based sensitivity sampling method that selects a small subset of data, achieving a theoretical guarantee of approximating the average loss of the entire dataset within multiplicative and additive error bounds. It demonstrates improved performance and scalability in fine-tuning foundation models and linear regression, outperforming state-of-the-art methods.
We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon λΦ_k$, where $Φ_k$ represents the $k$-means cost for the input embeddings and $λ$ is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.