CY CL LGOct 14, 2025

Toward LLM-Supported Automated Assessment of Critical Thinking Subskills

Marisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker, Blair Lehman, Danielle Eisenberg, Caitlin Mills, Keerthi Chebrolu, Sudhip Nashi, Cadence Young, Brayden Liu, Sherry Lachman, Andrew Lan

arXiv:2510.12915v11.22 citationsh-index: 1Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for scalable assessment of higher-order reasoning skills in education, representing an incremental step in learning analytics.

The paper tackled the problem of measuring critical thinking subskills in student-written argumentative essays by evaluating automated scoring methods using large language models, with GPT-5 achieving the strongest results through few-shot prompting, particularly for subskills with separable and frequent categories.

Critical thinking represents a fundamental competency in today's education landscape. Developing critical thinking skills through timely assessment and feedback is crucial; however, there has not been extensive work in the learning analytics community on defining, measuring, and supporting critical thinking. In this paper, we investigate the feasibility of measuring core "subskills" that underlie critical thinking. We ground our work in an authentic task where students operationalize critical thinking: student-written argumentative essays. We developed a coding rubric based on an established skills progression and completed human coding for a corpus of student essays. We then evaluated three distinct approaches to automated scoring: zero-shot prompting, few-shot prompting, and supervised fine-tuning, implemented across three large language models (GPT-5, GPT-5-mini, and ModernBERT). GPT-5 with few-shot prompting achieved the strongest results and demonstrated particular strength on subskills with separable, frequent categories, while lower performance was observed for subskills that required detection of subtle distinctions or rare categories. Our results underscore critical trade-offs in automated critical thinking assessment: proprietary models offer superior reliability at higher cost, while open-source alternatives provide practical accuracy with reduced sensitivity to minority categories. Our work represents an initial step toward scalable assessment of higher-order reasoning skills across authentic educational contexts.

View on arXiv PDF

Similar