CLMar 12, 2024

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

Tsinghua
arXiv:2403.07714v5117 citationsh-index: 35ACL
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of unstable benchmarks for researchers evaluating LLM tool learning, though it is incremental as it builds on existing benchmarks like ToolBench.

The authors tackled the instability in large-scale benchmarking for tool learning of large language models by introducing StableToolBench, which includes a virtual API server and stable evaluation system, resulting in demonstrated stability and effectiveness in experiments.

Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes