Is ToolLLM superseded?

ToolLLM (Tool use / function calling): heavily superseded — a standard baseline that newer methods routinely beat. 4 paper(s) critique it, 3 beat it on benchmarks — #2 of 55 most-superseded. Sub-problem: cluster led by ToolLLM. Newer alternatives in the same sub-problem include Planner-centric Plan-Execute paradigm.

Method Drift›Tool use / function calling

Heavily superseded#2 of 55 most-superseded

ToolLLM

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Tool use / function calling · first seen Jul 31, 2023

heavily superseded — a standard baseline that newer methods routinely beat

4 papers critique it · 3 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites ToolLLM as a baseline.

“the fine-tuned models from datasets like ToolLLM~qin2023toolllm, ToolAlpaca~tang2023toolalpaca, and Gorilla~patil2023gorilla underperform in one (or more) of three key dimensions: (a) Generalizability: While the datasets are generated using diverse sets of APIs (e.g., ToolLLama uses RapidAPIs~{https://rapidapi.com/hub}, ToolAlpaca uses public APIs{https://github.com/public-apis/public-apis}, and Gorilla uses TensorFlow Hub, PyTorch Hub, and Hugging Face Hub), work from~basu2024apiblend has shown that models trained on these datasets have difficulty generalizing to out-of-domain datasets.”
— Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
“Unlike our work, these datasets are not conversational and just focus on mapping utterances to API calls, and they do not use intermediate structures (i.e., graphs) to ensure coverage and reduce hallucinations in generated tests.”
— Automated test generation to evaluate tool-augmented LLMs as conversational AI agents
“ToolLLM employs a tree-based scheme to minimize the number of tools required for task execution, yet it requires LLM calls across the entire tool set, making it impractical for edge devices -- where delay and power consumption are critical”
— Less is More: Optimizing Function Calling for LLM Execution on Edge Devices
“Unfortunately, in cases when the entire toolset needs to be enumerated, ToolLLM suffers from increased latency and energy consumption.”
— CarbonCall: Sustainability-Aware Function Calling for Large Language Models on Edge Devices

Beaten on benchmarks

Head-to-head results where a newer method reports beating ToolLLM. Values are copied from the source paper's tables — verify against the cited paper.

ToolPlanner beats ToolLLM · Match Rate Avg. [SFT with FewShot/standard training]
55.8 vs 21.8
ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback
ToolPlanner beats ToolLLM · Pass Rate Avg. [SFT with FewShot/standard training]
84.4 vs 64.2
ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback
ToolPlanner beats ToolLLM · Win Rate Avg. [SFT with FewShot/standard training]
77.8 vs 71.2
ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback
Qwen3-8B (RL) beats ToolLLM · SoPR [I1-Inst.]
67.0 vs 42.7
Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoWR [I1-Inst.]
57.9 vs 36.2
Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoPR [I2-Inst.]
60.1 vs 39.9
Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoWR [I2-Inst.]
58.2 vs 49.1
Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoPR [I3-Inst.]
61.0 vs 29.8
Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoWR [I3-Inst.]
55.9 vs 41.0
Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoPR [Average]
59.8 vs 37.9
Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoWR [Average]
55.0 vs 39.3
Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoPR [I2-Cat.]
54.7 vs 40.9
Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.

Planner-centric Plan-Execute paradigm Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Nov 13, 2025