CL AIFeb 24, 2025

Topic Over Source: The Key to Effective Data Mixing for Language Models Pre-training

Jiahui Peng, Xinlin Zhuang, Jiantao Qiu, Ren Ma, Jing Yu, He Zhu, Conghui He

arXiv:2502.16802v39.64 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This addresses a crucial data optimization problem for LLM developers, though it represents an incremental improvement over existing data mixing methods.

The paper tackles the problem of optimizing pre-training data composition for large language models by proposing a topic-based data mixing strategy instead of traditional source-based approaches, demonstrating that topic-based mixing consistently outperforms source-based methods across multiple mixing techniques with significantly lower validation loss.

The performance of large language models (LLMs) is significantly affected by the quality and composition of their pre-training data, which is inherently diverse, spanning various languages, sources, and topics. Effectively integrating these heterogeneous data groups is crucial for optimizing LLM performance. Previous research has predominantly concentrated on source-based data mixing, often neglecting the nuanced topic-level characteristics of the data. To address this gap, we propose a topic-based data mixing strategy that utilizes detailed topic labels generated through a multi-stage process combining unsupervised clustering, LLM-based summarization, and supervised classifier training. With this strategy, we conduct the first comprehensive comparison of topic-based versus source-based partitioning across multiple mixing strategies. We demonstrate that language models pretrained on data mixed by topics consistently outperform those trained on data mixed by sources across multiple methods including RegMix, DoReMi,temperature-based sampling, and a manual mixing method based on downstream task performance. Our theoretical analysis reveals that topic-based data achieves significantly lower validation loss compared to source-based approaches, creating a better optimization landscape for model training. We will make our code, annotated datasets, and topic classification models publicly available to facilitate further research.

View on arXiv PDF

Similar