CLAIMAMay 1, 2024

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

arXiv:2405.00823v229 citationsh-index: 28Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of evaluating AI agents in high-stakes workplace settings for researchers and developers, though it is incremental as it builds on existing benchmark and agent frameworks.

The authors introduced WorkBench, a benchmark dataset for evaluating agents' ability to execute tasks in a realistic workplace setting, finding that existing agents like GPT-4 completed only 43% of tasks and Llama2-70B as few as 3%, revealing significant weaknesses in handling common business activities.

We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong action being taken, such as an email being sent to the wrong person. WorkBench reveals weaknesses in agents' ability to undertake common business activities, raising questions about their use in high-stakes workplace settings. WorkBench is publicly available as a free resource at https://github.com/olly-styles/WorkBench.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes