EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
This work addresses the challenge of developing generalizable AI agents for enterprise applications, representing an incremental advance with specific gains in task performance and transferability.
The paper tackles the problem of training AI agents to generalize beyond their training distribution by introducing CoreCraft, a high-fidelity enterprise simulation for customer support, and shows that training GLM 4.6 with GRPO and adaptive clipping improves task pass rates from 25.37% to 36.76% on held-out tasks and transfers gains to out-of-distribution benchmarks, such as +4.5% on BFCL Parallel.
We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on Tau2-Bench Retail, and +6.8% on Tool Decathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.