"I'm Not Mad, Just Focused'': Understanding Human Emotions in Human-Robot Collaboration
This work addresses the need for more accurate emotion recognition in human-robot collaboration, which is important for improving robot social intelligence and user experience.
The paper proposes a vision language model (VLM)-based emotion recognition system for human-robot collaboration that uses contextual understanding to improve emotion interpretation. In evaluations, the system achieved higher semantic similarity and positive sentiment alignment with human annotations compared to a baseline CNN, and participants preferred emotion-adaptive robot behavior facilitated by the system.
Human-robot collaboration (HRC) can benefit from robots' abilities to interpret human emotional states. However, current emotion recognition (ER) models in HRC often fall short, particularly due to their reliance on acted datasets and single-modality inputs like facial expressions. We propose a novel vision language model (VLM)-based ER system that leverages contextual understanding to improve emotion interpretation in HRC. We first evaluate the VLM-ER system by assessing its semantic and sentiment similarity with human annotations on an existing HRC dataset. Then, in a user study with a service robot in a collaborative delivery task, we evaluate the effects of modulating the robot's behaviour based on the user's emotional state inferred by the VLM-ER system. The results show that the proposed VLM-ER system achieves higher semantic similarity and positive sentiment alignment with human annotations compared to a baseline convolutional neural network-based system. Further, participants in the user study preferred emotion-adaptive robot behaviour facilitated by the VLM-ER system.