On Effectiveness and Efficiency of Agentic Tool-calling and RL Training
For researchers and practitioners developing LLM agents, this work highlights critical flaws in current evaluation practices and offers efficiency improvements for RL training.
This paper reveals that tool-calling evaluation in LLM agents is highly sensitive to undocumented implementation choices, making leaderboard rankings unreliable, and identifies computational waste in RL-based training, proposing two techniques that achieve substantial wall-clock speedup without performance degradation.
Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.