CLJul 26, 2024

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

arXiv:2407.19056v132 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the problem of automating complex office workflows for productivity enhancement, though it is incremental as it builds on existing document AI research.

The authors introduced OfficeBench, a benchmark for evaluating language agents on realistic office automation tasks that require long-horizon planning and application switching, finding that GPT-4 Omni achieved a 47.00% pass rate but still fell short of human performance.

Office automation significantly enhances human productivity by automatically finishing routine tasks in the workflow. Beyond the basic information extraction studied in much of the prior document AI literature, the office automation research should be extended to more realistic office tasks which require to integrate various information sources in the office system and produce outputs through a series of decision-making processes. We introduce OfficeBench, one of the first office automation benchmarks for evaluating current LLM agents' capability to address office tasks in realistic office workflows. OfficeBench requires LLM agents to perform feasible long-horizon planning, proficiently switch between applications in a timely manner, and accurately ground their actions within a large combined action space, based on the contextual demands of the workflow. Applying our customized evaluation methods on each task, we find that GPT-4 Omni achieves the highest pass rate of 47.00%, demonstrating a decent performance in handling office tasks. However, this is still far below the human performance and accuracy standards required by real-world office workflows. We further observe that most issues are related to operation redundancy and hallucinations, as well as limitations in switching between multiple applications, which may provide valuable insights for developing effective agent frameworks for office automation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes