Living systematic review
Tool use / function calling
Teaching LLMs to call external tools and APIs — function-calling, tool selection/retrieval, and tool-augmented agents.
52 papers · 79 critique receipts · 186 benchmark results · updated Jun 18, 2026
Most-superseded baselines
Ranked by how many distinct papers critique or beat each method. These are the standard baselines newer work routinely measures against.
- 1ReAct· ReActReAct: Synergizing Reasoning and Acting in Language Models
6 papers critique it · 4 beat it on benchmarks
- 2ToolLLM· ToolLLMToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
4 papers critique it · 3 beat it on benchmarks
- 3ToolACE· ReActToolACE: Winning the Points of LLM Function Calling
2 papers critique it · 4 beat it on benchmarks
- 4API-Bank· ReActAPI-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
5 papers critique it · 0 beat it on benchmarks
- 5Gorilla· ToolLLMGorilla: Large Language Model Connected with Massive APIs
3 papers critique it · 1 beat it on benchmarks
- 6GRPO· GRPOReinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
2 papers critique it · 1 beat it on benchmarks
- 8Toolformer· ToolformerToolformer: Language Models Can Teach Themselves to Use Tools
2 papers critique it · 1 beat it on benchmarks
- 9APIGen· ReActAPIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
3 papers critique it · 0 beat it on benchmarks
- 10xLAM· ReActxLAM: A Family of Large Action Models to Empower AI Agent Systems
0 papers critique it · 2 beat it on benchmarks
- 11ART· ARTART: Automatic multi-step reasoning and tool-use for large language models
1 papers critique it · 1 beat it on benchmarks
Sub-problems
Methods that compete on the same benchmarks cluster into distinct sub-problems.
ToolLLM · 10 methods
ToolLLM · Gorilla · ToolAlpaca · Less-is-More · TinyAgent · ToolPlanner
Probe&Prefill · 7 methods
Probe&Prefill · When2Tool / ToolReadable · Tool-identity steering · Tool-identity · NexusRaven · Functionary
AgentAuditor · 6 methods
AgentAuditor · AGrail · GuardAgent · LlamaFirewall · ShieldAgent · ToolSafe
GRPO · 5 methods
GRPO · SAGE · Reflexion · Reinforced Agent · RC-GRPO
Toolformer · 4 methods
Toolformer · CAST · CostBench · ToolAlign
ART · 3 methods
The frontier
Recent methods not yet superseded in the knowledge base.
- VitalAgentVitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health DataMay 28, 2026
- May 27, 2026
- May 20, 2026
- May 14, 2026
- May 8, 2026
- May 8, 2026
- Apr 29, 2026
- R2IFR2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function CallingApr 22, 2026
- FHABreaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic ModelsApr 22, 2026
- Apr 20, 2026
- AlphaQuanterAlphaQuanter: An End-to-End Tool-Augmented Agentic Reinforcement Learning Framework for Stock TradingApr 19, 2026
- Apr 10, 2026