HC AIMar 12

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

David Fraile Navarro, Farah Magrabi, Enrico Coiera

arXiv:2603.11413v17.64 citationsh-index: 75

Predicted impact top 40% in HC · last 90 daysOriginality Synthesis-oriented

AI Analysis

This work addresses the problem of misleading safety assessments in consumer health AI for researchers and policymakers, showing that evaluation methods must reflect real-world use to avoid overestimating risks.

The study found that the reported high under-triage rate of 51.6% for consumer health AI was largely due to an exam-style evaluation format, not model capability, as naturalistic interactions improved triage accuracy by 6.4 percentage points and specific conditions like diabetic ketoacidosis were correctly triaged 100% of the time.

Ramaswamy et al. reported in \textit{Nature Medicine} that ChatGPT Health under-triages 51.6\% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol -- forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions -- that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17-scenario partial replication bank under constrained (exam-style, 1,275 trials) and naturalistic (patient-style messages, 850 trials) conditions, with targeted ablations and prompt-faithful checks using the authors' released prompts. Naturalistic interaction improved triage accuracy by 6.4 percentage points ($p = 0.015$). Diabetic ketoacidosis was correctly triaged in 100\% of trials across all models and conditions. Asthma triage improved from 48\% to 80\%. The forced A/B/C/D format was the dominant failure mechanism: three models scored 0--24\% with forced choice but 100\% with free text (all $p < 10^{-8}$), consistently recommending emergency care in their own words while the forced-choice format registered under-triage. Prompt-faithful checks on the authors' exact released prompts confirmed the scaffold produces model-dependent, case-dependent results. The headline under-triage rate is highly contingent on evaluation format and should not be interpreted as a stable estimate of deployed triage behavior. Valid evaluation of consumer health AI requires testing under conditions that reflect actual use.

View on arXiv PDF

Similar