Class-attention Video Transformer for Engagement Intensity Prediction
This work addresses engagement prediction for educational video analysis, presenting an incremental improvement with novel components for video processing and data augmentation.
The paper tackles the problem of predicting student engagement intensity from videos by proposing CavT, a class-attention video transformer for handling variant-length videos, and BorS, a sampling method for data augmentation, achieving state-of-the-art MSE scores of 0.0495 on EmotiW-EP and 0.0377 on DAiSEE datasets.
In order to deal with variant-length long videos, prior works extract multi-modal features and fuse them to predict students' engagement intensity. In this paper, we present a new end-to-end method Class Attention in Video Transformer (CavT), which involves a single vector to process class embedding and to uniformly perform end-to-end learning on variant-length long videos and fixed-length short videos. Furthermore, to address the lack of sufficient samples, we propose a binary-order representatives sampling method (BorS) to add multiple video sequences of each video to augment the training set. BorS+CavT not only achieves the state-of-the-art MSE (0.0495) on the EmotiW-EP dataset, but also obtains the state-of-the-art MSE (0.0377) on the DAiSEE dataset. The code and models have been made publicly available at https://github.com/mountainai/cavt.