CLLGJun 10, 2024

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

arXiv:2406.06046v295 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the challenge of improving pretraining efficiency for language models by selecting higher-quality data, though it is incremental as it builds on existing data selection methods.

The paper tackles the problem of inefficient pretraining data selection by introducing MATES, a model-aware method that continuously adapts to evolving data preferences, resulting in doubled gains over state-of-the-art approaches and halving the FLOPs required for certain performances.

Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress. Specifically, we collect oracle data influence by locally probing the pretraining model and fine-tune a small data influence model to approximate it accurately. The data influence model then predicts data influence over the whole pretraining corpus and selects the most influential data for the next pretraining stage. Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks. It doubles the gains achieved by the state-of-the-art data selection approach that leverages larger reference models and reduces the total FLOPs required to reach certain performances by half. Further analyses validate the effectiveness of the locally probed oracle data influence and the approximation with data influence models. Our code is open-sourced at https://github.com/cxcscmu/MATES.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes