Boosting Accuracy and Efficiency of Budget Forcing in LLMs via Reinforcement Learning for Mathematical Reasoning
This work addresses efficiency and accuracy issues in budget forcing for mathematical reasoning in smaller LLMs, representing an incremental improvement over existing methods.
The paper tackled the problem of performance degradation in smaller language models due to verbose responses from supervised fine-tuning for budget forcing, by integrating reinforcement learning to improve token efficiency and accuracy. The result was a 1.5B model that achieved higher accuracy on the GSM8K dataset while reducing token usage by over 40% compared to the SFT model.
Test-time scaling methods have seen a rapid increase in popularity for its computational efficiency and parameter-independent training to improve reasoning performance on Large Language Models. One such method is called budget forcing, a decoding intervention strategy which allocates extra compute budget for thinking and elicits the inherent self-correcting behavior of the model. However, this relies on supervised fine-tuning (SFT) on long-context reasoning traces which causes performance degradation on smaller models due to verbose responses. For this reason, we offer a framework integrating reinforcement learning (RL) to improve token efficiency and boost the performance of a 1.5B model for mathematical reasoning. We demonstrate this using only 1.5K training samples and found that our SFT+RL model performed better on the GSM8K dataset with varying compute budgets. Our main findings showed an overall higher accuracy while significantly reducing its token usage by over 40% compared to the SFT model, revealing how RL can recover the losses due to long-context training and altogether improving performance in mathematical reasoning.