LGAICLMar 16

Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

arXiv:2506.0663267.966 citationsh-index: 25Has Code
Predicted impact top 1% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the challenge of inefficient reasoning improvement in small LLMs for mathematical and coding tasks, representing an incremental advancement in RL-based training methods.

The paper tackles the problem of improving reasoning capabilities in small language models (1.5B to 3B) by proposing a curriculum reinforcement learning method that schedules tasks from easy to hard, which significantly enhances reasoning ability compared to vanilla RL alone.

We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes