AIApr 21

AutomationBench

arXiv:2604.1893482.4h-index: 1
Predicted impact top 32% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For AI researchers and businesses, this benchmark provides a realistic measure of agentic capabilities needed for real-world business workflows, revealing that current models are far from adequate.

AutomationBench introduces a benchmark for evaluating AI agents on cross-application workflow orchestration via REST APIs, requiring endpoint discovery, policy adherence, and multi-system data writing. The best frontier models score below 10%, highlighting a significant gap in current agentic capabilities.

Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single task may span a CRM, inbox, calendar, and messaging platform - requiring the agent to find the right endpoints, follow a policy document, and write correct data to each system. To address this gap, we introduce AutomationBench, a benchmark for evaluating AI agents on cross-application workflow orchestration via REST APIs. Drawing on real workflow patterns from Zapier's platform, tasks span Sales, Marketing, Operations, Support, Finance, and HR domains. Agents must discover relevant endpoints themselves, follow layered business rules, and navigate environments with irrelevant and sometimes misleading records. Grading is programmatic and end-state only: whether the correct data ended up in the right systems. Even the best frontier models currently score below 10%. AutomationBench provides a challenging, realistic measure of where current models stand relative to the agentic capabilities businesses actually need.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes