Method Drift›Tool use / function calling
Superseded baseline#18 of 55 most-superseded
StableToolBench
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language ModelsTool use / function calling · first seen Mar 12, 2024
superseded — cited as a baseline and beaten by newer methods
2 papers critique it · 0 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites StableToolBench as a baseline.
“Most public benchmarks still overlook other enterprise-grade challenges, notably distinguishing among near-duplicate tools, proactively eliciting mandatory arguments, and detecting or preventing tool-call hallucinations, shortcomings our framework is expressly designed to remedy.”
— Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky“StableToolBench emphasized stability and reproducibility through API simulation and caching mechanisms, highlighting evaluation brittleness in real API-dependent setups”
— From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.