HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
This benchmark addresses the lack of evaluation for LLM agents on end-to-end healthcare administrative workflows, a $1 trillion domain, but is domain-specific and incremental in methodology.
HealthAdminBench introduces a benchmark for evaluating LLM-based computer-use agents on healthcare administration tasks, comprising 135 tasks across four GUI environments. The best-performing agent achieves only 36.3% task success, revealing a substantial gap between current capabilities and real-world demands.
Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.