AIDec 18, 2025

ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

Hao Chen, Zhexin Hu, Jiajun Chai, Haocheng Yang, Hang He, Xiaohan Wang, Wei Lin, Luhang Wang, Guojun Yin, Zhuofeng zhao

arXiv:2512.16149v114.76 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This addresses the high cost and lack of multi-hop reasoning in synthetic data generation for tool-learning, though it appears incremental as an improved pipeline for a specific domain.

The paper tackles the problem of training LLMs for tool invocation without costly real API calls by introducing ToolForge, a data synthesis pipeline that generates multi-hop search data using virtual tools, resulting in an 8B-parameter model outperforming GPT-4o on benchmarks.

Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .

View on arXiv PDF Code

Similar