BFS-PO: Best-First Search for Large Reasoning Models
This addresses efficiency and output conciseness issues for users of large reasoning models, but it is incremental as it builds on existing RL methods.
The paper tackles the problem of overthinking in Large Reasoning Models, which leads to high computational costs and verbose outputs, by proposing BFS-PO, an RL algorithm that uses Best-First Search to find the shortest correct answers, resulting in increased accuracy and shorter responses in benchmarks.
Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.