Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning
This work addresses the problem of inefficient fine-tuning in RL for AI researchers and practitioners, offering a method to reduce computational costs while maintaining performance, though it is incremental in improving existing RL techniques.
The paper tackles the inefficiency of small-scale supervised fine-tuning (SFT) in R1-style reinforcement learning (RL) for large language models by proposing Re-distillation, a technique that samples from RL-trained policies to boost SFT efficiency. The result shows that re-distilled models match RL performance with far fewer samples and less computation, with a re-distilled Qwen-2.5-1.5B model surpassing DeepSeek-V3-0324 on the K&K dataset using only 1K SFT samples.
R1-style Reinforcement Learning (RL) significantly enhances Large Language Models' reasoning capabilities, yet the mechanism behind rule-based RL remains unclear. We found that small-scale SFT has substantial influence on RL but shows poor efficiency. To explain our observations, we propose an analytical framework and compare the efficiency of SFT and RL by measuring \textbf{sample effect}. Our hypothetical analysis shows the potential to improve SFT efficiency. Guided by our analysis, we propose \textbf{Re-distillation}, a technique that aims to boost the effectiveness of small-scale distillation by sampling from the RL-trained policy. Re-distillation shows consistent surprising efficiency on three datasets and both Qwen\&Llama models: Re-distilled models matched RL performance with far fewer samples and less computation. As a result, on K\&K dataset, our re-distilled Qwen-2.5-1.5B model surpasses DeepSeek-V3-0324 with only 1K SFT samples. We demonstrate that re-distillation can be used to efficiently balance multiple goals in RL. Our work explains several interesting phenomena in R1-style RL, shedding light on the mechanisms behind its empirical success. Code is available at: https://github.com/on1262/deep-reasoning.