Is APIGen superseded?

APIGen (Tool use / function calling): superseded — cited as a baseline and beaten by newer methods. 3 paper(s) critique it, 0 beat it on benchmarks — #9 of 55 most-superseded. Sub-problem: cluster led by ReAct. Newer alternatives in the same sub-problem include GenesisFunc, Think-Augmented Function Calling (TAFC).

Method Drift›Tool use / function calling

Superseded baseline#9 of 55 most-superseded

APIGen

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

Tool use / function calling · first seen Jun 26, 2024

superseded — cited as a baseline and beaten by newer methods

3 papers critique it · 0 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites APIGen as a baseline.

“This method employs a rigorous verification process to improve data quality but is limited in scope, focusing predominantly on single-turn function-calling scenarios.”
— Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning
“While APIGen~liu2024apigen, ToolACE~liu2024toolace, and DeCRIM~ferraz2024llm produce verified function-call traces from fully specified queries, and works such as Clarify-When-Necessary~zhang2023clarify theorize when to seek clarification without training a model for it, DiaFORGE unifies three mutually reinforcing contributions absent from any single prior work: (i)~disambiguation-centric synthesis that structurally obliges the assistant to navigate near-duplicate API surfaces via injected distractors and a two-phase coercive dialogue protocol; (ii)~reasoning-trace SFT that jointly teaches tool disambiguation and argument solicitation in a single multi-turn curriculum across 3--70~B parameter models; and (iii)~a dynamic agentic evaluation that redeploys fine-tuned models in a live conversational loop with a simulated user, measuring end-to-end goal completion rather than isolated turn accuracy.”
— Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky
“Unlike our work, these datasets are not conversational and just focus on mapping utterances to API calls, and they do not use intermediate structures (i.e., graphs) to ensure coverage and reduce hallucinations in generated tests.”
— Automated test generation to evaluate tool-augmented LLMs as conversational AI agents

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.