AIApr 10

HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

arXiv:2604.0993775.53 citationsh-index: 22
AI Analysis

This benchmark addresses the lack of evaluation for LLM agents on end-to-end healthcare administrative workflows, a $1 trillion domain, but is domain-specific and incremental in methodology.

HealthAdminBench introduces a benchmark for evaluating LLM-based computer-use agents on healthcare administration tasks, comprising 135 tasks across four GUI environments. The best-performing agent achieves only 36.3% task success, revealing a substantial gap between current capabilities and real-world demands.

Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes