CLAIOct 27, 2024

Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

arXiv:2410.20362v2
Originality Highly original
AI Analysis

This addresses the need for diverse, high-quality instruction data in LLM training, offering a novel approach to improve synthetic data generation, though it is incremental in advancing existing methods.

The paper tackles the problem of synthetic data generation for LLMs by proposing a new training paradigm, NOMAD, which specifically optimizes models for data generation rather than general question-answering, resulting in gains of over 4% on TriviaQA and over 2% on GSM8K with limited training data.

Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data. Recently, many works are exploring synthetic data generation using LLMs. However, they primarily focus on prompt engineering with standard supervised instruction-finetuned models, which contains a fundamental limitation: these models are optimized for general question-answering/problem-solving rather than data generation. We propose a paradigm shift named \textbf{NOMAD} by investigating how to specifically train models for data generation, demonstrating that this task differs significantly from training a classical LM. We identify two key factors: no-prompt-masked training and proper training set size selection. Our method, NOMAD, shows substantial improvements over baselines, achieving >4\% gains in TriviaQA and >2\% in GSM8K with limited training data. Finally, we offer new insights by interpreting synthetic data through the lenses of "relevance" and "novelty".

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes