HC AI CL MAJun 27, 2025

Toward the Autonomous AI Doctor: Quantitative Benchmarking of an Autonomous Agentic AI Versus Board-Certified Clinicians in a Real World Setting

Hashim Hayat, Maksim Kudrautsau, Evgeniy Makarov, Vlad Melnichenko, Tim Tsykunou, Piotr Varaksin, Matt Pavelle, Adam Z. Oskowitz

arXiv:2507.22902v12 citationsh-index: 11

Originality Incremental advance

AI Analysis

This addresses the global shortage of healthcare practitioners and high administrative burden, offering a potential solution through autonomous AI, though it is incremental as it builds on existing LLM technology.

The study tackled the problem of healthcare workforce shortages and administrative burden by evaluating an autonomous AI doctor in a virtual urgent care setting, finding that the AI matched human clinicians in top diagnosis in 81% of cases and treatment plan alignment in 99.2% of cases, with AI performance superior in 36.1% of discordant cases.

Background: Globally we face a projected shortage of 11 million healthcare practitioners by 2030, and administrative burden consumes 50% of clinical time. Artificial intelligence (AI) has the potential to help alleviate these problems. However, no end-to-end autonomous large language model (LLM)-based AI system has been rigorously evaluated in real-world clinical practice. In this study, we evaluated whether a multi-agent LLM-based AI framework can function autonomously as an AI doctor in a virtual urgent care setting. Methods: We retrospectively compared the performance of the multi-agent AI system Doctronic and board-certified clinicians across 500 consecutive urgent-care telehealth encounters. The primary end points: diagnostic concordance, treatment plan consistency, and safety metrics, were assessed by blinded LLM-based adjudication and expert human review. Results: The top diagnosis of Doctronic and clinician matched in 81% of cases, and the treatment plan aligned in 99.2% of cases. No clinical hallucinations occurred (e.g., diagnosis or treatment not supported by clinical findings). In an expert review of discordant cases, AI performance was superior in 36.1%, and human performance was superior in 9.3%; the diagnoses were equivalent in the remaining cases. Conclusions: In this first large-scale validation of an autonomous AI doctor, we demonstrated strong diagnostic and treatment plan concordance with human clinicians, with AI performance matching and in some cases exceeding that of practicing clinicians. These findings indicate that multi-agent AI systems achieve comparable clinical decision-making to human providers and offer a potential solution to healthcare workforce shortages.

View on arXiv PDF

Similar