CY AIApr 3

The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost

arXiv:2604.2695423.8

Predicted impact top 77% in CY · last 90 daysOriginality Synthesis-oriented

AI Analysis

This provides practical guidance for educators and researchers on selecting LLM configurations to optimize scoring accuracy and cost, though the findings are incremental and domain-specific.

The study found that for automated scoring of student conversations, temperature sampling improves accuracy over deterministic calls, but increasing ensemble size (up to 7) yields no significant gains, while higher reasoning effort shows a positive linear trend with accuracy. The best cost-performance balance was achieved by low-cost models with no reasoning.

Strategic model selection and reasoning settings are more effective than ensembling for optimizing automated scoring with large language models (LLMs). We examined self-consistency (intra-model majority voting) and reasoning effort for scoring conversation-based assessment items in high school mathematics, evaluating 900 student conversations against human-scored ground truths using frontier and low-cost models from OpenAI and Google. Temperature sampling significantly improved accuracy over deterministic calls, but increasing ensemble size (j = 1 to 7) produced no significant gains. Higher reasoning effort showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family. An efficiency frontier analysis identified Gemini 3.1 Pro Preview at low reasoning as the most accurate but costly configuration; GPT-5.4 Nano and Mini with no reasoning offered the best cost-performance balance.

View on arXiv PDF

Similar