CLMESep 14, 2025

How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment

arXiv:2509.19329v12 citationsh-index: 7
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of ensuring reliable AI assessments in healthcare for researchers and practitioners, though it is incremental in nature.

The study investigated how model size, temperature, and prompt style influence the alignment of Large Language Models' assessments with human evaluations in clinical reasoning, finding that model size is a key factor in improving this alignment.

We examined how model size, temperature, and prompt style affect Large Language Models' (LLMs) alignment within itself, between models, and with human in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Study highlights the importance of checking alignments across multiple levels.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes