CLFeb 2, 2025

FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training

Liangyu Xu, Xuemiao Zhang, Feiyu Duan, Sirui Wang, Rongxiang Weng, Jingang Wang, Xunliang Cai

arXiv:2502.00761v36.73 citationsh-index: 12EMNLP

Originality Highly original

AI Analysis

This addresses the challenge of inefficient data selection for LLM pretraining, offering a scalable solution that enhances training efficiency and model performance across tasks.

The paper tackles the problem of selecting high-quality data for pretraining large language models by proposing FIRE, a framework that integrates multiple data quality raters, resulting in improved model performance with less than 37.5% of the training data compared to a random baseline.

Selecting high-quality data can improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques or single quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Extensive experiments show that FIRE outperforms other data selection methods and significantly boosts pretrained model performance across a wide range of downstream tasks, while requiring less than 37.5\% of the training data needed by the Random baseline to reach the target performance.

View on arXiv PDF

Similar