AIMay 10

TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

arXiv:2605.0954439.2
Predicted impact top 20% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For researchers developing tool-integrated reasoning methods, this benchmark provides a more comprehensive and efficient evaluation than existing ones.

TIDE-Bench introduces a unified benchmark for evaluating tool-integrated reasoning in LLMs, featuring diverse tasks, task-aware metrics, and high-quality filtered datasets. Experiments reveal persistent bottlenecks in tool grounding.

Tool-integrated reasoning has emerged as a promising paradigm for enhancing large language models with external computation, retrieval, and execution capabilities. However, the field still lacks a high-quality and unified evaluation benchmark, and existing TIR evaluations remain limited in dataset quality, task diversity, diagnostic comprehensiveness, and evaluation efficiency. In this work, we introduce TIDE-Bench, a holistic and efficient benchmark for evaluating TIR methods, featuring three key advantages. First, it provides diverse task settings, combining widely used mathematical reasoning and knowledge-intensive QA tasks with two newly designed tasks, namely the tool-grounded experimental design task and the dynamic interactive task, to probe models' abilities in complex tool invocation and multi-tool coordination. Second, TIDE-Bench adopts a comprehensive yet task-aware evaluation protocol, jointly measuring final answer quality, process reliability, tool-use efficiency, and inference cost across heterogeneous task settings. Third, TIDE-Bench constructs high-quality and discriminative evaluation sets by filtering low-discrimination instances from existing datasets, substantially reducing evaluation cost while focusing on more challenging samples. Extensive experiments on multiple foundation models and TIR methods reveal persistent bottlenecks in tool grounding, offering insights for future TIR research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes