Data Augmentation for Human Behavior Analysis in Multi-Person Conversations
This work addresses the problem of improving accuracy in multi-modal behavior analysis for researchers and practitioners in multimedia and human-computer interaction, but it is incremental as it builds on existing methods with data augmentation.
The paper tackled human behavior analysis in multi-person conversations by applying data augmentation strategies to a Swin Transformer baseline for three tasks, achieving best results of 0.6262 mean average precision for bodily behavior recognition and 0.7771 accuracy for eye contact detection.
In this paper, we present the solution of our team HFUT-VUT for the MultiMediate Grand Challenge 2023 at ACM Multimedia 2023. The solution covers three sub-challenges: bodily behavior recognition, eye contact detection, and next speaker prediction. We select Swin Transformer as the baseline and exploit data augmentation strategies to address the above three tasks. Specifically, we crop the raw video to remove the noise from other parts. At the same time, we utilize data augmentation to improve the generalization of the model. As a result, our solution achieves the best results of 0.6262 for bodily behavior recognition in terms of mean average precision and the accuracy of 0.7771 for eye contact detection on the corresponding test set. In addition, our approach also achieves comparable results of 0.5281 for the next speaker prediction in terms of unweighted average recall.