Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks
This work addresses the challenge of improving reasoning in LLMs for mathematical problem-solving, offering a resource-efficient method, though it appears incremental as it builds on existing supervised fine-tuning approaches.
The paper tackles the problem of enhancing large language models' self-correction capabilities in mathematical reasoning by proposing a two-stage training framework that uses self-generated chain-of-thought data, resulting in performance improvements on benchmarks such as GSM8K, MATH500, and AIME24.
In recent years, large language models (LLMs) have demonstrated significant potential in complex reasoning tasks like mathematical problem-solving. However, existing research predominantly relies on reinforcement learning (RL) frameworks while overlooking supervised fine-tuning (SFT) methods. This paper proposes a new two-stage training framework that enhances models' self-correction capabilities through self-generated long chain-of-thought (CoT) data. During the first stage, a multi-turn dialogue strategy guides the model to generate CoT data incorporating verification, backtracking, subgoal decomposition, and backward reasoning, with predefined rules filtering high-quality samples for supervised fine-tuning. The second stage employs a difficulty-aware rejection sampling mechanism to dynamically optimize data distribution, strengthening the model's ability to handle complex problems. The approach generates reasoning chains extended over 4 times longer while maintaining strong scalability, proving that SFT effectively activates models' intrinsic reasoning capabilities and provides a resource-efficient pathway for complex task optimization. Experimental results demonstrate performance improvements on mathematical benchmarks including GSM8K and MATH500, with the fine-tuned model achieving a substantial improvement on competition-level problems like AIME24. Code will be open-sourced.