HCAIMar 12

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

arXiv:2603.11413v17.04 citationsh-index: 4
Predicted impact top 40% in HC · last 90 daysOriginality Synthesis-oriented
AI Analysis

This work addresses the problem of misleading safety assessments in consumer health AI for researchers and policymakers, showing that evaluation methods must reflect real-world use to avoid overestimating risks.

The study found that the reported high under-triage rate of 51.6% for consumer health AI was largely due to an exam-style evaluation format, not model capability, as naturalistic interactions improved triage accuracy by 6.4 percentage points and specific conditions like diabetic ketoacidosis were correctly triaged 100% of the time.

Ramaswamy et al. reported in \textit{Nature Medicine} that ChatGPT Health under-triages 51.6\% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol -- forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions -- that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17-scenario partial replication bank under constrained (exam-style, 1,275 trials) and naturalistic (patient-style messages, 850 trials) conditions, with targeted ablations and prompt-faithful checks using the authors' released prompts. Naturalistic interaction improved triage accuracy by 6.4 percentage points ($p = 0.015$). Diabetic ketoacidosis was correctly triaged in 100\% of trials across all models and conditions. Asthma triage improved from 48\% to 80\%. The forced A/B/C/D format was the dominant failure mechanism: three models scored 0--24\% with forced choice but 100\% with free text (all $p < 10^{-8}$), consistently recommending emergency care in their own words while the forced-choice format registered under-triage. Prompt-faithful checks on the authors' exact released prompts confirmed the scaffold produces model-dependent, case-dependent results. The headline under-triage rate is highly contingent on evaluation format and should not be interpreted as a stable estimate of deployed triage behavior. Valid evaluation of consumer health AI requires testing under conditions that reflect actual use.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes