CV LGMay 9, 2023

Integrating Holistic and Local Information to Estimate Emotional Reaction Intensity

Yini Fang, Liang Wu, Frederic Jumelle, Bertram Shi

arXiv:2305.05534v12.83 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses emotional reaction analysis for affective computing applications, representing an incremental improvement over existing methods.

The authors tackled video-based emotional reaction intensity estimation by proposing a multi-modal architecture combining video and audio information with holistic and local facial features, achieving Pearson Correlation Coefficients of 0.455 on validation and 0.4547 on test datasets in the ABAW5 competition.

Video-based Emotional Reaction Intensity (ERI) estimation measures the intensity of subjects' reactions to stimuli along several emotional dimensions from videos of the subject as they view the stimuli. We propose a multi-modal architecture for video-based ERI combining video and audio information. Video input is encoded spatially first, frame-by-frame, combining features encoding holistic aspects of the subjects' facial expressions and features encoding spatially localized aspects of their expressions. Input is then combined across time: from frame-to-frame using gated recurrent units (GRUs), then globally by a transformer. We handle variable video length with a regression token that accumulates information from all frames into a fixed-dimensional vector independent of video length. Audio information is handled similarly: spectral information extracted within each frame is integrated across time by a cascade of GRUs and a transformer with regression token. The video and audio regression tokens' outputs are merged by concatenation, then input to a final fully connected layer producing intensity estimates. Our architecture achieved excellent performance on the Hume-Reaction dataset in the ERI Esimation Challenge of the Fifth Competition on Affective Behavior Analysis in-the-Wild (ABAW5). The Pearson Correlation Coefficients between estimated and subject self-reported scores, averaged across all emotions, were 0.455 on the validation dataset and 0.4547 on the test dataset, well above the baselines. The transformer's self-attention mechanism enables our architecture to focus on the most critical video frames regardless of length. Ablation experiments establish the advantages of combining holistic/local features and of multi-modal integration. Code available at https://github.com/HKUST-NISL/ABAW5.

View on arXiv PDF Code

Similar