AI CLOct 22, 2025

A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist

arXiv:2510.19139v23.3

Originality Synthesis-oriented

AI Analysis

This addresses the need for reliable and explainable AI in healthcare automation, though it is incremental in analyzing existing methods on a specific medical evaluation task.

The study evaluated how well large language models assess clinical trial reporting against CONSORT standards, finding that both general and specialized models showed significant miscalibration and overconfidence, with calibration errors exceeding clinically relevant thresholds.

Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge. In particular, uncertainty calibration and metacognitive reliability of LLM reasoning are poorly understood and underexplored in medical automation. This study applies a behavioral and metacognitive analytic approach using an expert-validated dataset, systematically comparing two representative LLMs - one general and one domain-specialized - across three prompt strategies. We analyze both cognitive adaptation and calibration error using metrics: Expected Calibration Error (ECE) and a baseline-normalized Relative Calibration Error (RCE) that enables reliable cross-model comparison. Our results reveal pronounced miscalibration and overconfidence in both models, especially under clinical role-playing conditions, with calibration error persisting above clinically relevant thresholds. These findings underscore the need for improved calibration, transparent code, and strategic prompt engineering to develop reliable and explainable medical AI.

View on arXiv PDF

Similar