AI HCDec 17, 2024

A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI

Deep Bhatt, Surya Ayyagari, Anuruddh Mishra

arXiv:2412.12538v12.34 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the problem of diagnostic errors in healthcare by providing a reproducible framework for evaluating AI systems, though it is incremental as it builds on existing AI chatbot approaches.

The study tackled the lack of standardized evaluation for health AI chatbots by introducing a scalable benchmarking methodology, which when applied to the chatbot August, achieved a top-one diagnostic accuracy of 81.8% and required 47% fewer questions than traditional symptom checkers.

Diagnostic errors in healthcare persist as a critical challenge, with increasing numbers of patients turning to online resources for health information. While AI-powered healthcare chatbots show promise, there exists no standardized and scalable framework for evaluating their diagnostic capabilities. This study introduces a scalable benchmarking methodology for assessing health AI systems and demonstrates its application through August, an AI-driven conversational chatbot. Our methodology employs 400 validated clinical vignettes across 14 medical specialties, using AI-powered patient actors to simulate realistic clinical interactions. In systematic testing, August achieved a top-one diagnostic accuracy of 81.8% (327/400 cases) and a top-two accuracy of 85.0% (340/400 cases), significantly outperforming traditional symptom checkers. The system demonstrated 95.8% accuracy in specialist referrals and required 47% fewer questions compared to conventional symptom checkers (mean 16 vs 29 questions), while maintaining empathetic dialogue throughout consultations. These findings demonstrate the potential of AI chatbots to enhance healthcare delivery, though implementation challenges remain regarding real-world validation and integration of objective clinical data. This research provides a reproducible framework for evaluating healthcare AI systems, contributing to the responsible development and deployment of AI in clinical settings.

View on arXiv PDF

Similar