Method Drift›Tool use / function calling
Gorilla
Gorilla: Large Language Model Connected with Massive APIsTool use / function calling · first seen May 24, 2023
heavily superseded — a standard baseline that newer methods routinely beat
3 papers critique it · 1 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Gorilla as a baseline.
“the fine-tuned models from datasets like ToolLLM~qin2023toolllm, ToolAlpaca~tang2023toolalpaca, and Gorilla~patil2023gorilla underperform in one (or more) of three key dimensions: (a) Generalizability: While the datasets are generated using diverse sets of APIs (e.g., ToolLLama uses RapidAPIs~{https://rapidapi.com/hub}, ToolAlpaca uses public APIs{https://github.com/public-apis/public-apis}, and Gorilla uses TensorFlow Hub, PyTorch Hub, and Hugging Face Hub), work from~basu2024apiblend has shown that models trained on these datasets have difficulty generalizing to out-of-domain datasets.”
— Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks“Unlike our work, these datasets are not conversational and just focus on mapping utterances to API calls, and they do not use intermediate structures (i.e., graphs) to ensure coverage and reduce hallucinations in generated tests.”
— Automated test generation to evaluate tool-augmented LLMs as conversational AI agents“However, these techniques might lack generalizability across diverse function spaces, limiting their applicability to dynamic and sustainability-aware function calling at the edge.”
— CarbonCall: Sustainability-Aware Function Calling for Large Language Models on Edge Devices
Beaten on benchmarks
Head-to-head results where a newer method reports beating Gorilla. Values are copied from the source paper's tables — verify against the cited paper.
- Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
Granite-20B-FunctionCalling beats Gorilla · Func. Match (F1) [ToolLLM-G1 Func. Match]
0.86 vs 0.59
- Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
Granite-20B-FunctionCalling beats Gorilla · LCS [ToolLLM-G1 LCS]
0.85 vs 0.59
- Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
Granite-20B-FunctionCalling beats Gorilla · Exact Score (F1) [ToolLLM-G1 Exact Score]
0.63 vs 0.28
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.