ThinkBrake: Mitigating Overthinking in Tool Reasoning
This addresses a specific bottleneck in tool reasoning for small models, offering an incremental improvement over existing methods.
The paper tackles the problem of small reasoning models overthinking during tool use, where they overwrite correct tool-argument configurations with incorrect calls, and introduces ThinkBrake, a training-free decoding heuristic that improves average accuracy from 85.8% to 94.2% while reducing tokens by up to 25% on the Berkeley Function Calling Leaderboard.
Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject </think> at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8\% to 94.2\% while reducing tokens by 80-94\%, revealing substantial recoverable headroom and potential redundant reasoning. While prior work on concise reasoning has largely targeted mathematics, tool reasoning remains underexplored. We adapt various early-termination baselines to tool use and introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors the log-probability margin between </think> and the current top token at sentence boundaries and triggers termination when this margin becomes small. Across BFCL's single turn, non-live and live splits, ThinkBrake preserves or improves accuracy while reducing tokens up to 25\%, outperforming various baselines.