CL AIJun 2

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

arXiv:2606.0319894.3Has Code

Predicted impact top 15% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers evaluating clinical AI systems, this work demonstrates that rubric-anchored scoring is necessary to preserve discriminative power when using LLMs as raters, while rubric-free scoring is insufficient for complex decision-making tasks.

The study shows that AI raters (LLMs) scoring clinical decisions in type 2 diabetes pharmacotherapy produce narrow, inflated scores without a rubric, but a rubric-anchored protocol amplifies discrimination between CDSS outputs by factors of 1.76–5.10 and reveals rater model variation that rubric-free scoring suppresses.

Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation questions. Four open-source LLMs served simultaneously as clinical decision support system (CDSS) models and AI raters. Each CDSS output was scored under two scoring protocols: a rubric-anchored Gold Rubric (GR) protocol incorporating a patient-specific rubric, and a rubric-free Non Gold Rubric (Non-GR) protocol. Linear mixed effects models crossed the scoring protocol factor with five design factors -- CDSS model, CDSS prompt configuration (document-referenced generation [DRG] vs.\ Baseline), rater model, prompt character, and prompt type -- and estimated main effects together with their protocol interactions. Across all questions, AI raters yielded consistently higher scores within a very narrow range (74--78 points on average) under Non-GR compared to those under GR (7.69 to 49.64 points lower mean scores; 1.68 to 3.67 times wider interquartile ranges). Within each question, GR amplified the AI rater's discrimination between DRG and Baseline CDSS outputs by factors of 1.76 to 5.10, while also revealing substantial behavioral variation across rater models that Non-GR suppressed. These findings support rubric anchoring as the scoring protocol that preserves discriminative power in clinical AI evaluation; rubric-free scoring cannot substitute when questions require patient-specific or jurisdiction-specific criteria that rater models cannot infer from parametric knowledge alone.

View on arXiv PDF

Similar