Method Drift›Tool use / function calling
ToolLLM
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIsTool use / function calling · first seen Jul 31, 2023
heavily superseded — a standard baseline that newer methods routinely beat
4 papers critique it · 3 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites ToolLLM as a baseline.
“the fine-tuned models from datasets like ToolLLM~qin2023toolllm, ToolAlpaca~tang2023toolalpaca, and Gorilla~patil2023gorilla underperform in one (or more) of three key dimensions: (a) Generalizability: While the datasets are generated using diverse sets of APIs (e.g., ToolLLama uses RapidAPIs~{https://rapidapi.com/hub}, ToolAlpaca uses public APIs{https://github.com/public-apis/public-apis}, and Gorilla uses TensorFlow Hub, PyTorch Hub, and Hugging Face Hub), work from~basu2024apiblend has shown that models trained on these datasets have difficulty generalizing to out-of-domain datasets.”
— Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks“Unlike our work, these datasets are not conversational and just focus on mapping utterances to API calls, and they do not use intermediate structures (i.e., graphs) to ensure coverage and reduce hallucinations in generated tests.”
— Automated test generation to evaluate tool-augmented LLMs as conversational AI agents“ToolLLM employs a tree-based scheme to minimize the number of tools required for task execution, yet it requires LLM calls across the entire tool set, making it impractical for edge devices -- where delay and power consumption are critical”
— Less is More: Optimizing Function Calling for LLM Execution on Edge Devices“Unfortunately, in cases when the entire toolset needs to be enumerated, ToolLLM suffers from increased latency and energy consumption.”
— CarbonCall: Sustainability-Aware Function Calling for Large Language Models on Edge Devices
Beaten on benchmarks
Head-to-head results where a newer method reports beating ToolLLM. Values are copied from the source paper's tables — verify against the cited paper.
- ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback
ToolPlanner beats ToolLLM · Match Rate Avg. [SFT with FewShot/standard training]
55.8 vs 21.8
- ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback
ToolPlanner beats ToolLLM · Pass Rate Avg. [SFT with FewShot/standard training]
84.4 vs 64.2
- ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback
ToolPlanner beats ToolLLM · Win Rate Avg. [SFT with FewShot/standard training]
77.8 vs 71.2
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoPR [I1-Inst.]
67.0 vs 42.7
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoWR [I1-Inst.]
57.9 vs 36.2
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoPR [I2-Inst.]
60.1 vs 39.9
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoWR [I2-Inst.]
58.2 vs 49.1
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoPR [I3-Inst.]
61.0 vs 29.8
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoWR [I3-Inst.]
55.9 vs 41.0
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoPR [Average]
59.8 vs 37.9
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoWR [Average]
55.0 vs 39.3
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Qwen3-8B (RL) beats ToolLLM · SoPR [I2-Cat.]
54.7 vs 40.9
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.