Is Gorilla superseded?

Gorilla (Tool use / function calling): heavily superseded — a standard baseline that newer methods routinely beat. 3 paper(s) critique it, 1 beat it on benchmarks — #5 of 55 most-superseded. Sub-problem: cluster led by ToolLLM. Newer alternatives in the same sub-problem include Planner-centric Plan-Execute paradigm.

Method Drift›Tool use / function calling

Heavily superseded#5 of 55 most-superseded

Gorilla

Gorilla: Large Language Model Connected with Massive APIs

Tool use / function calling · first seen May 24, 2023

heavily superseded — a standard baseline that newer methods routinely beat

3 papers critique it · 1 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Gorilla as a baseline.

“the fine-tuned models from datasets like ToolLLM~qin2023toolllm, ToolAlpaca~tang2023toolalpaca, and Gorilla~patil2023gorilla underperform in one (or more) of three key dimensions: (a) Generalizability: While the datasets are generated using diverse sets of APIs (e.g., ToolLLama uses RapidAPIs~{https://rapidapi.com/hub}, ToolAlpaca uses public APIs{https://github.com/public-apis/public-apis}, and Gorilla uses TensorFlow Hub, PyTorch Hub, and Hugging Face Hub), work from~basu2024apiblend has shown that models trained on these datasets have difficulty generalizing to out-of-domain datasets.”
— Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
“Unlike our work, these datasets are not conversational and just focus on mapping utterances to API calls, and they do not use intermediate structures (i.e., graphs) to ensure coverage and reduce hallucinations in generated tests.”
— Automated test generation to evaluate tool-augmented LLMs as conversational AI agents
“However, these techniques might lack generalizability across diverse function spaces, limiting their applicability to dynamic and sustainability-aware function calling at the edge.”
— CarbonCall: Sustainability-Aware Function Calling for Large Language Models on Edge Devices

Beaten on benchmarks

Head-to-head results where a newer method reports beating Gorilla. Values are copied from the source paper's tables — verify against the cited paper.

Granite-20B-FunctionCalling beats Gorilla · Func. Match (F1) [ToolLLM-G1 Func. Match]
0.86 vs 0.59
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
Granite-20B-FunctionCalling beats Gorilla · LCS [ToolLLM-G1 LCS]
0.85 vs 0.59
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
Granite-20B-FunctionCalling beats Gorilla · Exact Score (F1) [ToolLLM-G1 Exact Score]
0.63 vs 0.28
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.

Planner-centric Plan-Execute paradigm Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
Nov 13, 2025