CL MESep 14, 2025

How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment

Julie Jung, Max Lu, Sina Chole Benker, Dogus Darici

arXiv:2509.19329v12 citationsh-index: 7

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of ensuring reliable AI assessments in healthcare for researchers and practitioners, though it is incremental in nature.

The study investigated how model size, temperature, and prompt style influence the alignment of Large Language Models' assessments with human evaluations in clinical reasoning, finding that model size is a key factor in improving this alignment.

We examined how model size, temperature, and prompt style affect Large Language Models' (LLMs) alignment within itself, between models, and with human in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Study highlights the importance of checking alignments across multiple levels.

View on arXiv PDF

Similar