CVIVJul 12, 2021

Spatial and Temporal Networks for Facial Expression Recognition in the Wild Videos

arXiv:2107.05160v19 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of classifying basic expressions in diverse, real-world video scenarios for affective behavior analysis, representing an incremental improvement through ensemble methods.

The paper tackled facial expression recognition in wild videos by proposing an ensemble model combining CNN, CNN-RNN, and CNN-Transformer to incorporate spatial and temporal information, achieving an F1 score of 0.4133, accuracy of 0.6216, and final metric of 0.4821 on a validation set.

The paper describes our proposed methodology for the seven basic expression classification track of Affective Behavior Analysis in-the-wild (ABAW) Competition 2021. In this task, facial expression recognition (FER) methods aim to classify the correct expression category from a diverse background, but there are several challenges. First, to adapt the model to in-the-wild scenarios, we use the knowledge from pre-trained large-scale face recognition data. Second, we propose an ensemble model with a convolution neural network (CNN), a CNN-recurrent neural network (CNN-RNN), and a CNN-Transformer (CNN-Transformer), to incorporate both spatial and temporal information. Our ensemble model achieved F1 as 0.4133, accuracy as 0.6216 and final metric as 0.4821 on the validation set.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes