LGAIFeb 16

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

arXiv:2602.14868v1h-index: 6
Originality Incremental advance
AI Analysis

This addresses the problem of inefficient training for reasoning tasks in AI, though it is an incremental improvement over existing curriculum learning methods.

The paper tackles the sample inefficiency of reinforcement learning for reasoning in large language models due to sparse rewards by proposing Goldilocks, a teacher-driven data sampling strategy that selects questions of appropriate difficulty for the student model. On the OpenMathReasoning dataset, Goldilocks improves model performance compared to standard GRPO under the same compute budget.

Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes