CLJun 27, 2025

AutoMixer: Checkpoint Artifacts as Automatic Data Mixers

arXiv:2506.21910v24 citationsh-index: 14ACL
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in language model training for researchers and practitioners by providing an incremental method to enhance data quality using existing checkpoint artifacts.

The paper tackles the problem of determining optimal data mixtures for training language models on multiple tasks by using checkpoint artifacts as data mixers, achieving performance improvements of up to 1.93% on eight reasoning benchmarks.

In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes