CL AI LG NEJun 24, 2023

Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data

Brando Miranda, Alycia Lee, Sudharsan Sundar, Allison Casasola, Rylan Schaeffer, Elyas Obbad, Sanmi Koyejo

arXiv:2306.13840v45.519 citationsh-index: 21

Originality Incremental advance

AI Analysis

This work addresses the nebulous concept of data quality in LLM training by providing a formal metric, though it is incremental as it focuses on one aspect of quality rather than a comprehensive solution.

The authors tackled the problem of measuring data quality for LLM pre-training by proposing a diversity coefficient to quantify variability in natural language data, and they demonstrated that higher diversity in pre-training data leads to improved downstream performance across 44 models up to 7B parameters.

Current trends in pre-training Large Language Models (LLMs) primarily focus on the scaling of model and dataset size. While the quality of pre-training data is considered an important factor for training powerful LLMs, it remains a nebulous concept that has not been rigorously characterized. To this end, we propose a formalization of one key aspect of data quality -- measuring the variability of natural language data -- specifically via a measure we call the diversity coefficient. Our empirical analysis shows that the proposed diversity coefficient aligns with the intuitive properties of diversity and variability, e.g., it increases as the number of latent concepts increases. Then, we measure the diversity coefficient of publicly available pre-training datasets and demonstrate that their formal diversity is high compared to theoretical lower and upper bounds. Finally, we conduct a comprehensive set of controlled interventional experiments with GPT-2 and LLaMAv2 that demonstrate the diversity coefficient of pre-training data characterizes useful aspects of downstream model evaluation performance -- totaling 44 models of various sizes (51M to 7B parameters). We conclude that our formal notion of diversity is an important aspect of data quality that captures variability and causally leads to improved evaluation performance.

View on arXiv PDF

Similar