CLDec 4, 2024

Evaluating Language Models as Synthetic Data Generators

CMU
arXiv:2412.03679v231 citationsh-index: 34ACL
Originality Incremental advance
AI Analysis

This work addresses the need for standardized evaluation in synthetic data generation for language model post-training, though it is incremental as it builds on prior methods by providing a systematic comparison framework.

The paper tackles the lack of systematic comparison of language models as synthetic data generators by introducing AgoraBench, a benchmark that evaluates their abilities through standardized settings and metrics, revealing that data generation capabilities do not correlate with problem-solving abilities and are better indicated by intrinsic data quality features.

Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes