CLApr 26, 2025

When2Call: When (not) to Call Tools

arXiv:2504.18851v128 citationsh-index: 8Has CodeNAACL
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation of tool-calling decisions in language models, which is incremental as it builds on existing tool-calling benchmarks by focusing on decision-making rather than just accuracy.

The authors tackled the problem of evaluating when language models should or should not call external tools, developing a new benchmark called When2Call that assesses tool-calling decision-making. They found that state-of-the-art tool-calling LMs show significant room for improvement on this benchmark, and their preference optimization training regime achieved considerably more improvement than traditional fine-tuning.

Leveraging external tools is a key feature for modern Language Models (LMs) to expand their capabilities and integrate them into existing systems. However, existing benchmarks primarily focus on the accuracy of tool calling -- whether the correct tool is called with the correct parameters -- and less on evaluating when LMs should (not) call tools. We develop a new benchmark, When2Call, which evaluates tool-calling decision-making: when to generate a tool call, when to ask follow-up questions and when to admit the question can't be answered with the tools provided. We find that state-of-the-art tool-calling LMs show significant room for improvement on When2Call, indicating the importance of this benchmark. We also develop a training set for When2Call and leverage the multiple-choice nature of the benchmark to develop a preference optimization training regime, which shows considerably more improvement than traditional fine-tuning. We release the benchmark and training data as well as evaluation scripts at https://github.com/NVIDIA/When2Call.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes