Is ToolACE superseded?

ToolACE (Tool use / function calling): heavily superseded — a standard baseline that newer methods routinely beat. 2 paper(s) critique it, 4 beat it on benchmarks — #3 of 55 most-superseded. Sub-problem: cluster led by ReAct. Newer alternatives in the same sub-problem include GenesisFunc, Think-Augmented Function Calling (TAFC).

Method Drift›Tool use / function calling

Heavily superseded#3 of 55 most-superseded

ToolACE

ToolACE: Winning the Points of LLM Function Calling

Tool use / function calling · first seen Sep 2, 2024

heavily superseded — a standard baseline that newer methods routinely beat

2 papers critique it · 4 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites ToolACE as a baseline.

“Prior works often rely on annotated or synthetic APIs, which lack reliability and struggle to scale across larger tool sets. These approaches also face limitations in diversity, quality, and coverage.”
— GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling
“While APIGen~liu2024apigen, ToolACE~liu2024toolace, and DeCRIM~ferraz2024llm produce verified function-call traces from fully specified queries, and works such as Clarify-When-Necessary~zhang2023clarify theorize when to seek clarification without training a model for it, DiaFORGE unifies three mutually reinforcing contributions absent from any single prior work: (i)~disambiguation-centric synthesis that structurally obliges the assistant to navigate near-duplicate API surfaces via injected distractors and a two-phase coercive dialogue protocol; (ii)~reasoning-trace SFT that jointly teaches tool disambiguation and argument solicitation in a single multi-turn curriculum across 3--70~B parameter models; and (iii)~a dynamic agentic evaluation that redeploys fine-tuned models in a live conversational loop with a simulated user, measuring end-to-end goal completion rather than isolated turn accuracy.”
— Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Beaten on benchmarks

Head-to-head results where a newer method reports beating ToolACE. Values are copied from the source paper's tables — verify against the cited paper.

\name (FC) beats ToolACE · Overall Accuracy [Non-Live + Live combined]
86.33 vs 81.78
ToolACE-R: Model-aware Iterative Training and Adaptive Refinement for Tool Learning
ToolACE-DEV beats ToolACE · Overall [BFCL benchmark overall]
82.44 vs 80.70
ToolACE-DEV: Self-Improving Tool Learning via Decomposition and EVolution
AgentLM-7B beats ToolACE · Overall [Tool-Use-Finetuned Models]
37.1 vs 10.3
CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios
ASA beats ToolACE · Overall First Call Accuracy [NESTFUL evaluation]
41.94 vs 28.76
ASA: Activation Steering for Tool-Calling Domain Adaptation

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.