CLAILGMay 18, 2025

Synthetic Data RL: Task Definition Is All You Need

arXiv:2505.17063v111 citationsh-index: 11Has Code
Originality Highly original
AI Analysis

This enables scalable and efficient RL-based adaptation of foundation models for specialized tasks, reducing the need for costly human annotations.

The paper tackles the problem of reinforcement learning's reliance on large-scale human-labeled data by introducing Synthetic Data RL, a framework that fine-tunes models using only synthetic data generated from task definitions, achieving absolute improvements of up to 29.2% on benchmarks like GSM8K and outperforming supervised fine-tuning with the same data budget.

Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes