LG AI CLJun 15, 2024

CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, Jonathan Richard Schwarz, Sham Kakade

arXiv:2406.10670v319.318 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of data selection for language model pre-training, offering a scalable solution that reduces computational costs, though it is incremental as it builds on existing heuristics.

The paper tackles the problem of selecting high-quality data for pre-training language models by proposing CoLoR-Filter, a method that uses loss values from auxiliary models to efficiently choose subsets, resulting in training a 1.2b parameter model with 25x less data for Books and 11x less for downstream tasks compared to random selection.

Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more aggressively and using small auxiliary models to select data for large target models. As one headline result, CoLoR-Filter data selected using a pair of 150m parameter auxiliary models can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks. Code: https://github.com/davidbrandfonbrener/color-filter-olmo Filtered data: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4

View on arXiv PDF Code

Similar