AI CLMay 19, 2025

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

Safal Shrestha, Minwu Kim, Aadim Nepal, Anubhav Shrestha, Keith Ross

arXiv:2505.13718v27.82 citationsh-index: 3Has CodeEMNLP

Originality Incremental advance

AI Analysis

This addresses the problem of data scarcity for researchers and practitioners aiming to develop reasoning LLMs, though it is incremental as it builds on existing RLVR and distillation methods.

The paper tackles the challenge of training reasoning-capable LLMs with limited data by proposing a two-stage strategy: first warming up the model using distillation from logic puzzles to acquire general reasoning skills, then applying RLVR on a small target dataset. The results show performance improvements on tasks like MATH, HumanEval+, and MMLU-Pro, with the warmed-up model outperforming the base model using ≤100 examples and enhancing sample efficiency.

Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: $(i)$ the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval$^{+}$, and MMLU-Pro; $(ii)$ When both the base model and the warmed-up model are RLVR trained on the same small dataset ($\leq100$ examples), the warmed-up model consistently outperforms the base model; $(iii)$ Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; $(iv)$ Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.

View on arXiv PDF Code

Similar