HCApr 6

Decoding Student Dialogue: A Multi-Dimensional Comparison and Bias Analysis of Large Language Models as Annotation Tools

arXiv:2604.0437071.2
Predicted impact top 8% in HC · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the problem of time-consuming manual annotation in educational research by analyzing LLM efficacy and biases, though it is incremental as it focuses on comparing existing models and prompting strategies.

This study evaluated GPT-5.2 and Gemini-3 as annotation tools for educational dialogue using three prompting strategies, finding that multi-agent prompting achieved the highest accuracy (though not statistically significant) with performance varying by context—K-12 datasets showed significantly higher accuracy than university-level data, and accuracy peaked in the affective dimension but remained lowest in the cognitive dimension.

Educational dialogue is critical for decoding student learning processes, yet manual annotation remains time-consuming. This study evaluates the efficacy of GPT-5.2 and Gemini-3 using three prompting strategies (few-shot, single-agent, and multi-agent reflection) across diverse subjects, educational levels, and four coding dimensions. Results indicate that while multi-agent prompting achieved the highest accuracy, the results did not reach statistical significance. Accuracy proved highly context-dependent, with significantly higher performance in K-12 datasets compared to university-level data, alongside disciplinary variations within the same educational level. Performance peaked in the affective dimension but remained lowest in the cognitive dimension. Furthermore, analysis revealed four bias patterns: (1) Gemini-3 exhibited a consistent optimistic bias in the affective dimension across all subjects; (2) the cognitive dimension displayed domain-specific directional bias, characterized by systematic underestimation in Mathematics versus overestimation in Psychology; (3) both models are more prone to overestimation than underestimation within the meta-cognitive dimension; and (4) behavioral categories such as question, negotiation, and statements were frequently misclassified. These results underscore the need for context-sensitive deployment and targeted mitigation of directional biases in automated annotation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes