CLHCFeb 25, 2025

Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty

arXiv:2502.17785v17 citationsh-index: 4HCI
Originality Synthesis-oriented
AI Analysis

This work addresses the scalability challenge in educational assessment for learners and systems, offering an incremental improvement by applying existing LLMs to automate difficulty estimation.

The study investigated using OpenAI's GPT-4o and o1 to estimate reading comprehension question difficulty on the SARA dataset, finding that the models' estimates aligned meaningfully with IRT parameters but showed differences in sensitivity to extreme item characteristics.

Reading comprehension is a key for individual success, yet the assessment of question difficulty remains challenging due to the extensive human annotation and large-scale testing required by traditional methods such as linguistic analysis and Item Response Theory (IRT). While these robust approaches provide valuable insights, their scalability is limited. There is potential for Large Language Models (LLMs) to automate question difficulty estimation; however, this area remains underexplored. Our study investigates the effectiveness of LLMs, specifically OpenAI's GPT-4o and o1, in estimating the difficulty of reading comprehension questions using the Study Aid and Reading Assessment (SARA) dataset. We evaluated both the accuracy of the models in answering comprehension questions and their ability to classify difficulty levels as defined by IRT. The results indicate that, while the models yield difficulty estimates that align meaningfully with derived IRT parameters, there are notable differences in their sensitivity to extreme item characteristics. These findings suggest that LLMs can serve as the scalable method for automated difficulty assessment, particularly in dynamic interactions between learners and Adaptive Instructional Systems (AIS), bridging the gap between traditional psychometric techniques and modern AIS for reading comprehension and paving the way for more adaptive and personalized educational assessments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes