Living systematic review

Tool use / function calling

Teaching LLMs to call external tools and APIs — function-calling, tool selection/retrieval, and tool-augmented agents.

52 papers · 79 critique receipts · 186 benchmark results · updated Jun 18, 2026

Most-superseded baselines

Ranked by how many distinct papers critique or beat each method. These are the standard baselines newer work routinely measures against.

1
ReAct· ReAct
ReAct: Synergizing Reasoning and Acting in Language Models
6 papers critique it · 4 beat it on benchmarks
2
ToolLLM· ToolLLM
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
4 papers critique it · 3 beat it on benchmarks
3
ToolACE· ReAct
ToolACE: Winning the Points of LLM Function Calling
2 papers critique it · 4 beat it on benchmarks
4
API-Bank· ReAct
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
5 papers critique it · 0 beat it on benchmarks
5
Gorilla· ToolLLM
Gorilla: Large Language Model Connected with Massive APIs
3 papers critique it · 1 beat it on benchmarks
6
GRPO· GRPO
Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
2 papers critique it · 1 beat it on benchmarks
7
StepTool· StepTool
2 papers critique it · 1 beat it on benchmarks
8
Toolformer· Toolformer
Toolformer: Language Models Can Teach Themselves to Use Tools
2 papers critique it · 1 beat it on benchmarks
9
APIGen· ReAct
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
3 papers critique it · 0 beat it on benchmarks
10
xLAM· ReAct
xLAM: A Family of Large Action Models to Empower AI Agent Systems
0 papers critique it · 2 beat it on benchmarks
11
ART· ART
ART: Automatic multi-step reasoning and tool-use for large language models
1 papers critique it · 1 beat it on benchmarks
12
ExpeL· ART
ExpeL: LLM Agents Are Experiential Learners
1 papers critique it · 1 beat it on benchmarks

Sub-problems

Methods that compete on the same benchmarks cluster into distinct sub-problems.

ReAct · 24 methods

ReAct · ToolACE · API-Bank · APIGen · xLAM · StableToolBench

ToolLLM · 10 methods

ToolLLM · Gorilla · ToolAlpaca · Less-is-More · TinyAgent · ToolPlanner

StepTool · 9 methods

StepTool · ToolRL · CodeAct · Search-R1 · Search-R1 PPO · R2IF

Probe&Prefill · 7 methods

Probe&Prefill · When2Tool / ToolReadable · Tool-identity steering · Tool-identity · NexusRaven · Functionary

AgentAuditor · 6 methods

AgentAuditor · AGrail · GuardAgent · LlamaFirewall · ShieldAgent · ToolSafe

GRPO · 5 methods

GRPO · SAGE · Reflexion · Reinforced Agent · RC-GRPO

Toolformer · 4 methods

Toolformer · CAST · CostBench · ToolAlign

ART · 3 methods

ART · ExpeL · Stepwise Experience Recall (SEER)

ReTool · 4 methods

ReTool · SWiRL · CoCoDA · SPaRK

Mem0 · 4 methods

Mem0 · NLSI · PEToolLLM · PRefine

The frontier

Recent methods not yet superseded in the knowledge base.