CVJul 1, 2025

Context-Aware Academic Emotion Dataset and Benchmark

Luming Zhao, Jingwen Xuan, Jiamin Lou, Yonghui Yu, Wenwu Yang

arXiv:2507.00586v13.6h-index: 2

Originality Incremental advance

AI Analysis

It addresses the problem of academic emotion recognition for educational applications, but is incremental as it builds on existing vision-language models and dataset creation.

This paper tackles the challenge of recognizing academic emotions from facial expressions in real-world learning environments by introducing RAER, a novel dataset of 2,700 video clips from diverse natural contexts, and CLIP-CAER, a method that integrates context cues using CLIP, which substantially outperforms state-of-the-art methods.

Academic emotion analysis plays a crucial role in evaluating students' engagement and cognitive states during the learning process. This paper addresses the challenge of automatically recognizing academic emotions through facial expressions in real-world learning environments. While significant progress has been made in facial expression recognition for basic emotions, academic emotion recognition remains underexplored, largely due to the scarcity of publicly available datasets. To bridge this gap, we introduce RAER, a novel dataset comprising approximately 2,700 video clips collected from around 140 students in diverse, natural learning contexts such as classrooms, libraries, laboratories, and dormitories, covering both classroom sessions and individual study. Each clip was annotated independently by approximately ten annotators using two distinct sets of academic emotion labels with varying granularity, enhancing annotation consistency and reliability. To our knowledge, RAER is the first dataset capturing diverse natural learning scenarios. Observing that annotators naturally consider context cues-such as whether a student is looking at a phone or reading a book-alongside facial expressions, we propose CLIP-CAER (CLIP-based Context-aware Academic Emotion Recognition). Our method utilizes learnable text prompts within the vision-language model CLIP to effectively integrate facial expression and context cues from videos. Experimental results demonstrate that CLIP-CAER substantially outperforms state-of-the-art video-based facial expression recognition methods, which are primarily designed for basic emotions, emphasizing the crucial role of context in accurately recognizing academic emotions. Project page: https://zgsfer.github.io/CAER

View on arXiv PDF

Similar