Method Drift›Tool use / function calling
ToolACE
ToolACE: Winning the Points of LLM Function CallingTool use / function calling · first seen Sep 2, 2024
heavily superseded — a standard baseline that newer methods routinely beat
2 papers critique it · 4 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites ToolACE as a baseline.
“Prior works often rely on annotated or synthetic APIs, which lack reliability and struggle to scale across larger tool sets. These approaches also face limitations in diversity, quality, and coverage.”
— GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling“While APIGen~liu2024apigen, ToolACE~liu2024toolace, and DeCRIM~ferraz2024llm produce verified function-call traces from fully specified queries, and works such as Clarify-When-Necessary~zhang2023clarify theorize when to seek clarification without training a model for it, DiaFORGE unifies three mutually reinforcing contributions absent from any single prior work: (i)~disambiguation-centric synthesis that structurally obliges the assistant to navigate near-duplicate API surfaces via injected distractors and a two-phase coercive dialogue protocol; (ii)~reasoning-trace SFT that jointly teaches tool disambiguation and argument solicitation in a single multi-turn curriculum across 3--70~B parameter models; and (iii)~a dynamic agentic evaluation that redeploys fine-tuned models in a live conversational loop with a simulated user, measuring end-to-end goal completion rather than isolated turn accuracy.”
— Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky
Beaten on benchmarks
Head-to-head results where a newer method reports beating ToolACE. Values are copied from the source paper's tables — verify against the cited paper.
- ToolACE-R: Model-aware Iterative Training and Adaptive Refinement for Tool Learning
\name (FC) beats ToolACE · Overall Accuracy [Non-Live + Live combined]
86.33 vs 81.78
- ToolACE-DEV: Self-Improving Tool Learning via Decomposition and EVolution
ToolACE-DEV beats ToolACE · Overall [BFCL benchmark overall]
82.44 vs 80.70
- CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios
AgentLM-7B beats ToolACE · Overall [Tool-Use-Finetuned Models]
37.1 vs 10.3
- ASA: Activation Steering for Tool-Calling Domain Adaptation
ASA beats ToolACE · Overall First Call Accuracy [NESTFUL evaluation]
41.94 vs 28.76
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.