CLSep 26, 2024

Data Proportion Detection for Optimized Data Management for Large Language Models

arXiv:2409.17527v1h-index: 13
Originality Incremental advance
AI Analysis

This addresses a critical data management challenge for LLM researchers, offering a novel approach to optimize performance when integrating multi-domain data, though it appears incremental as it builds on existing LLM frameworks.

The paper tackles the problem of determining optimal data proportions for pre-training large language models (LLMs) by introducing data proportion detection, which automatically estimates these proportions from model outputs, with preliminary experimental results provided.

Large language models (LLMs) have demonstrated exceptional performance across a wide range of tasks and domains, with data preparation playing a critical role in achieving these results. Pre-training data typically combines information from multiple domains. To maximize performance when integrating data from various domains, determining the optimal data proportion is essential. However, state-of-the-art (SOTA) LLMs rarely disclose details about their pre-training data, making it difficult for researchers to identify ideal data proportions. In this paper, we introduce a new topic, \textit{data proportion detection}, which enables the automatic estimation of pre-training data proportions by analyzing the generated outputs of LLMs. We provide rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection. Based on these findings, we offer valuable insights into the challenges and future directions for effective data proportion detection and data management.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes