HCJan 21
LLM-based Multimodal Feedback Produces Equivalent Learning and Better Student Perceptions than Educator FeedbackChloe Qianhui Zhao, Jie Cao, Jionghao Lin et al.
Providing timely, targeted, and multimodal feedback helps students quickly correct errors, build deep understanding and stay motivated, yet making it at scale remains a challenge. This study introduces a real-time AI-facilitated multimodal feedback system that integrates structured textual explanations with dynamic multimedia resources, including the retrieved most relevant slide page references and streaming AI audio narration. In an online crowdsourcing experiment, we compared this system against fixed business-as-usual feedback by educators across three dimensions: (1) learning effectiveness, (2) learner engagement, (3) perceived feedback quality and value. Results showed that AI multimodal feedback achieved learning gains equivalent to original educator feedback while significantly outperforming it on perceived clarity, specificity, conciseness, motivation, satisfaction, and reducing cognitive load, with comparable correctness, trust, and acceptance. Process logs revealed distinct engagement patterns: for multiple-choice questions, educator feedback encouraged more submissions; for open-ended questions, AI-facilitated targeted suggestions lowered revision barriers and promoted iterative improvement. These findings highlight the potential of AI multimodal feedback to provide scalable, real-time, and context-aware support that both reduces instructor workload and enhances student experience.
HCApr 8
To Layer or Not to Layer? Evaluating the Effects and Mechanisms of LLM-Generated Feedback on learning performanceJie Cao, Chloe Qianhui Zhao, Christian Schunn et al.
Feedback is vital for learning, yet its effectiveness depends not only on its content but also on how it engages students in the learning process. Large Language Models (LLMs) offer novel opportunities to efficiently generate rich, formative feedback, ranging from direct explanations to incrementally layered scaffolding designed to foster learner autonomy. Despite these affordances, it remains unclear whether layered feedback (which sequences encouragement and prompts prior to revealing the correct answer) actually improves engagement and learning outcomes. To address this, we randomly assigned 199 participants to receive either layered or non-layered LLM-generated feedback. We assessed its impact on learning performance, behavioral and cognitive engagement, and affective perceptions, to determine how these factors mediate learning performance. Results indicate that layered feedback elicited slightly higher behavioral engagement and, as anticipated, was perceived as more encouraging and supportive of independence. However, it concurrently induced greater mental effort. Mediation analyses revealed a positive affective pathway driven by perceived encouragement, which was counteracted by a negative behavioral pathway linked to the average number of tasks requiring $\geq 3$ submissions; the cognitive pathway (mental effort) was non-significant. Taken together, layered feedback resulted in significantly poorer learning outcomes compared to non-layered feedback. These findings illuminate a critical trade-off: while layered scaffolding enhances engagement and positive perceptions, it can detrimentally impact actual learning performance. This study contributes nuanced insights for the design of automated, LLM-driven feedback systems by integrating outcome, perception, and mechanism-level analyses.
HCApr 3, 2025
Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue EvaluationMegan Gu, Chloe Qianhui Zhao, Claire Liu et al. · cmu
Our study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies: 1. giving effective praise, 2. reacting to errors, 3. determining what students know, 4. helping students manage inequity, and 5. responding to negative self-talk. Using a public dataset from the Teacher-Student Chatroom Corpus, our system classifies each tutoring strategy as either being employed as desired or undesired. Our study utilizes GPT-3.5 with few-shot prompting to assess the use of these strategies and analyze tutoring dialogues. The results show that for the five tutoring strategies, True Negative Rates (TNR) range from 0.655 to 0.738, and Recall ranges from 0.327 to 0.432, indicating that the model is effective at excluding incorrect classifications but struggles to consistently identify the correct strategy. The strategy \textit{helping students manage inequity} showed the highest performance with a TNR of 0.738 and Recall of 0.432. The study highlights the potential of LLMs in tutoring strategy analysis and outlines directions for future improvements, including incorporating more advanced models for more nuanced feedback.