Is StepTool superseded?

StepTool (Tool use / function calling): superseded — cited as a baseline and beaten by newer methods. 2 paper(s) critique it, 1 beat it on benchmarks — #7 of 55 most-superseded. Sub-problem: cluster led by StepTool. Newer alternatives in the same sub-problem include CARL, R2IF.

Method Drift›Tool use / function calling

Superseded baseline#7 of 55 most-superseded

StepTool

Tool use / function calling

superseded — cited as a baseline and beaten by newer methods

2 papers critique it · 1 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites StepTool as a baseline.

“StepTool~yu2024steptool scores each step with GPT-4 during PPO, but the external judge can only score calls that were made, not penalize calls that should not have been, and its knowledge may diverge from the student's.”
— Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use
“the acquisition of process rewards heavily relies on GPT-based annotations”
— CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision

Beaten on benchmarks

Head-to-head results where a newer method reports beating StepTool. Values are copied from the source paper's tables — verify against the cited paper.

CodeTool beats StepTool · SoPR (Solvable Pass Rate) [Qwen2.5-Coder-7B]
69.75 vs 44.02
CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.