ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents
This addresses the problem of improving reliability and compliance for tool-using AI agents in critical domains like finance, representing a strong incremental advance in reward design.
The paper tackles the challenge of aligning tool-integrated agents for high-stakes, domain-specific deployment by introducing ToolRLA, a post-training pipeline with a fine-grained multiplicative reward function. The result is a 47% improvement in task completion rate, a 63% reduction in tool invocation errors, and a 93% reduction in regulatory violations when deployed on a financial advisory copilot.
Tool-integrated agents that interleave reasoning with API calls are promising for complex tasks, yet aligning them for high-stakes, domain-specific deployment remains challenging: existing reinforcement learning approaches rely on coarse binary rewards that cannot distinguish tool selection errors from malformed parameters. We present ToolRLA, a three-stage post-training pipeline (SFT $\rightarrow$ GRPO $\rightarrow$ DPO) for domain-specific tool agents. The core contribution is a fine-grained reward function with multiplicative correctness decomposition spanning four dimensions -- format validity, tool selection, parameter accuracy, and regulatory compliance -- that encodes domain priority orderings as inductive biases in the reward landscape. Deployed on a financial advisory copilot (80+ advisors, 1,200+ daily queries), ToolRLA achieves over three months: a 47\% improvement in task completion rate ($62\%\rightarrow91\%$), a 63\% reduction in tool invocation errors ($38\%\rightarrow14\%$), and a 93\% reduction in regulatory violations ($12\%\rightarrow0.8\%$), within sub-2-second latency. Ablation studies show the multiplicative reward design accounts for 7 percentage points of improvement over additive alternatives. Generalization is further validated on ToolBench and API-Bank.