Towards Multimodal Emotion Recognition in German Speech Events in Cars using Transfer Learning
This work addresses emotion recognition for in-car interactions, which is domain-specific and incremental as it applies existing methods to a new multimodal dataset.
The paper tackled multimodal emotion recognition in German speech events in cars by comparing off-the-shelf audio and visual tools with a neural transfer learning approach for text, finding that transfer learning improved F1 by up to 10 percentage points, achieving up to 76 micro-average F1 for emotions like joy, annoyance, and insecurity.
The recognition of emotions by humans is a complex process which considers multiple interacting signals such as facial expressions and both prosody and semantic content of utterances. Commonly, research on automatic recognition of emotions is, with few exceptions, limited to one modality. We describe an in-car experiment for emotion recognition from speech interactions for three modalities: the audio signal of a spoken interaction, the visual signal of the driver's face, and the manually transcribed content of utterances of the driver. We use off-the-shelf tools for emotion detection in audio and face and compare that to a neural transfer learning approach for emotion recognition from text which utilizes existing resources from other domains. We see that transfer learning enables models based on out-of-domain corpora to perform well. This method contributes up to 10 percentage points in F1, with up to 76 micro-average F1 across the emotions joy, annoyance and insecurity. Our findings also indicate that off-the-shelf-tools analyzing face and audio are not ready yet for emotion detection in in-car speech interactions without further adjustments.