MA LGMay 9

Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

Tao Yu, Hao Wang, Changyu Li, Shenghua Chai, Minghui Zhang, Zhongtian Luo, Yuxuan Zhou, Haopeng Jin, Zhaolu Kang, Jiabing Yang, YiFan Zhang, Xinming Wang

arXiv:2605.0876199.0

Predicted impact top 1% in MA · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and developers building LLM agent systems for enterprise environments, this benchmark fills a gap by testing realistic multi-agent collaboration constraints not covered by existing benchmarks.

The paper introduces EntCollabBench, a benchmark for evaluating multi-agent collaboration in enterprise workflows with role specialization, access control, and policy-based approvals. Experiments show current LLM agents struggle with delegation, context transfer, and decision commitment in realistic organizational settings.

Large language model (LLM) agents are increasingly expected to operate in enterprise environments, where work is distributed across specialized roles, permission-controlled systems, and cross-departmental procedures. However, existing enterprise benchmarks largely evaluate single agents with broad tool access, while existing multi-agent benchmarks rarely capture realistic enterprise constraints such as role specialization, access control, stateful business systems, and policy-based approvals. We introduce \textsc{EntCollabBench}, a benchmark for evaluating enterprise multi-agent collaboration. \textsc{EntCollabBench} simulates a permission-isolated organization with 11 role-specialized agents across six departments and contains two evaluation subsets: a Workflow subset, where agents collaboratively modify enterprise system states, and an Approval subset, where agents make policy-grounded decisions. Evaluation is based on execution traces, database state verification, and deterministic policy adjudication rather than natural-language response judging. Experiments with representative LLM agents show that current models still struggle with end-to-end enterprise collaboration, especially in delegation, context transfer, parameter grounding, workflow closure, and decision commitment. \textsc{EntCollabBench} provides a reproducible testbed for measuring and improving agent systems intended for realistic organizational environments.

View on arXiv PDF

Similar