CL AISep 29, 2025

Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models

Zhimeng Luo, Lixin Wu, Adam Frisch, Daqing He

arXiv:2509.24186v12 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses the need for reliable evaluation in high-stakes medical LLM deployment, offering a psychometric methodology to improve safety and trustworthiness, though it is incremental in applying existing IRT methods to a new domain.

The paper tackled the problem of evaluating large language models (LLMs) for medical applications by introducing MedIRT, an Item Response Theory-based framework, and found that overall accuracy rankings can be misleading, with GPT-5 leading in 8 of 11 domains but being outperformed by Claude-3-opus in Social Science and Communication.

As Large Language Models (LLMs) are increasingly proposed for high-stakes medical applications, there has emerged a critical need for reliable and accurate evaluation methodologies. Traditional accuracy metrics fail inadequately as they neither capture question characteristics nor offer topic-specific insights. To address this gap, we introduce \textsc{MedIRT}, a rigorous evaluation framework grounded in Item Response Theory (IRT), the gold standard in high-stakes educational testing. Unlike previous research relying on archival data, we prospectively gathered fresh responses from 80 diverse LLMs on a balanced, 1,100-question USMLE-aligned benchmark. Using one unidimensional two-parameter logistic IRT model per topic, we estimate LLM's latent model ability jointly with question difficulty and discrimination, yielding more stable and nuanced performance rankings than accuracy alone. Notably, we identify distinctive ``spiky'' ability profiles, where overall rankings can be misleading due to highly specialized model abilities. While \texttt{GPT-5} was the top performer in a majority of domains (8 of 11), it was outperformed in Social Science and Communication by \texttt{Claude-3-opus}, demonstrating that even an overall 23rd-ranked model can hold the top spot for specific competencies. Furthermore, we demonstrate IRT's utility in auditing benchmarks by identifying flawed questions. We synthesize these findings into a practical decision-support framework that integrates our multi-factor competency profiles with operational metrics. This work establishes a robust, psychometrically grounded methodology essential for the safe, effective, and trustworthy deployment of LLMs in healthcare.

View on arXiv PDF

Similar