CLFeb 2, 2025

FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training

arXiv:2502.00761v33 citationsh-index: 12EMNLP
Originality Highly original
AI Analysis

This addresses the challenge of inefficient data selection for LLM pretraining, offering a scalable solution that enhances training efficiency and model performance across tasks.

The paper tackles the problem of selecting high-quality data for pretraining large language models by proposing FIRE, a framework that integrates multiple data quality raters, resulting in improved model performance with less than 37.5% of the training data compared to a random baseline.

Selecting high-quality data can improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques or single quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Extensive experiments show that FIRE outperforms other data selection methods and significantly boosts pretrained model performance across a wide range of downstream tasks, while requiring less than 37.5\% of the training data needed by the Random baseline to reach the target performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes