CLFeb 10

AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis

Zexu Sun, Bokai Ji, Hengyi Cai, Shuaiqiang Wang, Lei Wang, Guangxia Li, Xu Chen

arXiv:2602.09372v12.13 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the data scarcity problem for researchers and developers aiming to scale generalist AI agents, though it is incremental as it builds on existing data synthesis methods.

The paper tackles the bottleneck of scarce high-quality, long-horizon data for scaling generalist LLM agents by proposing AgentSkiller, a fully automated framework that synthesizes multi-turn interaction data across semantically linked domains, resulting in models trained on this dataset achieving significant improvements in function calling over baselines, especially for larger models.

Large Language Model agents demonstrate potential in solving real-world problems via tools, yet generalist intelligence is bottlenecked by scarce high-quality, long-horizon data. Existing methods collect privacy-constrained API logs or generate scripted interactions lacking diversity, which struggle to produce data requisite for scaling capabilities. We propose AgentSkiller, a fully automated framework synthesizing multi-turn interaction data across realistic, semantically linked domains. It employs a DAG-based architecture with explicit state transitions to ensure determinism and recoverability. The pipeline builds a domain ontology and Person-Centric Entity Graph, defines tool interfaces via Service Blueprints for Model Context Protocol servers, and populates environments with consistent databases and strict Domain Policies. A cross-domain fusion mechanism links services to simulate complex tasks. Finally, the pipeline creates user tasks by verifying solution paths, filtering via execution-based validation, and generating queries using a Persona-based Simulator for automated rollout. This produces reliable environments with clear state changes. To demonstrate effectiveness, we synthesized $\approx$ 11K interaction samples; experimental results indicate that models trained on this dataset achieve significant improvements on function calling over baselines, particularly in larger parameter regimes.

View on arXiv PDF

Similar