CVMay 3, 2024

Enhancing Micro Gesture Recognition for Emotion Understanding via Context-aware Visual-Text Contrastive Learning

arXiv:2405.01885v111.321 citationsh-index: 6Has CodeIEEE Signal Processing Letters

Originality Incremental advance

AI Analysis

This work addresses emotion understanding through nonverbal gestures, which is important for applications like human-computer interaction, but it is incremental as it builds on existing contrastive learning approaches by adding text and adaptive prompts.

The paper tackles micro gesture recognition for emotion understanding by proposing a visual-text contrastive learning method with adaptive prompting, achieving state-of-the-art performance on two public datasets and showing that using textual results improves emotion understanding performance by over 6% compared to using video directly.

Psychological studies have shown that Micro Gestures (MG) are closely linked to human emotions. MG-based emotion understanding has attracted much attention because it allows for emotion understanding through nonverbal body gestures without relying on identity information (e.g., facial and electrocardiogram data). Therefore, it is essential to recognize MG effectively for advanced emotion understanding. However, existing Micro Gesture Recognition (MGR) methods utilize only a single modality (e.g., RGB or skeleton) while overlooking crucial textual information. In this letter, we propose a simple but effective visual-text contrastive learning solution that utilizes text information for MGR. In addition, instead of using handcrafted prompts for visual-text contrastive learning, we propose a novel module called Adaptive prompting to generate context-aware prompts. The experimental results show that the proposed method achieves state-of-the-art performance on two public datasets. Furthermore, based on an empirical study utilizing the results of MGR for emotion understanding, we demonstrate that using the textual results of MGR significantly improves performance by 6%+ compared to directly using video as input.

View on arXiv PDF Code

Similar