Analyzing Large Language Models for Classroom Discussion Assessment
This work addresses the problem of automating classroom discussion assessment for educators, but it is incremental as it analyzes existing LLMs rather than introducing new methods.
The study investigated how task formulation, context length, and few-shot examples affect the performance of two large language models in assessing classroom discussion quality, finding that these factors influence performance and that consistency relates to performance, recommending a balanced approach.
Automatically assessing classroom discussion quality is becoming increasingly feasible with the help of new NLP advancements such as large language models (LLMs). In this work, we examine how the assessment performance of 2 LLMs interacts with 3 factors that may affect performance: task formulation, context length, and few-shot examples. We also explore the computational efficiency and predictive consistency of the 2 LLMs. Our results suggest that the 3 aforementioned factors do affect the performance of the tested LLMs and there is a relation between consistency and performance. We recommend a LLM-based assessment approach that has a good balance in terms of predictive performance, computational efficiency, and consistency.