Is StableToolBench superseded?

StableToolBench (Tool use / function calling): superseded — cited as a baseline and beaten by newer methods. 2 paper(s) critique it, 0 beat it on benchmarks — #18 of 55 most-superseded. Sub-problem: cluster led by ReAct. Newer alternatives in the same sub-problem include GenesisFunc, Think-Augmented Function Calling (TAFC).

Method Drift›Tool use / function calling

Superseded baseline#18 of 55 most-superseded

StableToolBench

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

Tool use / function calling · first seen Mar 12, 2024

superseded — cited as a baseline and beaten by newer methods

2 papers critique it · 0 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites StableToolBench as a baseline.

“Most public benchmarks still overlook other enterprise-grade challenges, notably distinguishing among near-duplicate tools, proactively eliciting mandatory arguments, and detecting or preventing tool-call hallucinations, shortcomings our framework is expressly designed to remedy.”
— Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky
“StableToolBench emphasized stability and reproducibility through API simulation and caching mechanisms, highlighting evaluation brittleness in real API-dependent setups”
— From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.