LBVCNN: Local Binary Volume Convolutional Neural Network for Facial Expression Recognition from Image Sequences
This work addresses facial expression recognition for computer vision applications, presenting an incremental improvement by introducing a parameter-efficient 3D CNN without relying on facial landmarks.
The authors tackled facial expression recognition from image sequences by proposing a new 3D CNN with a Local Binary Volume layer, achieving comparable results to state-of-the-art models on CK+, Oulu-CASIA, and UNBC McMaster datasets while reducing trainable parameters by 27 times compared to conventional 3D convolutional layers.
Recognizing facial expressions is one of the central problems in computer vision. Temporal image sequences have useful spatio-temporal features for recognizing expressions. In this paper, we propose a new 3D Convolution Neural Network (CNN) that can be trained end-to-end for facial expression recognition on temporal image sequences without using facial landmarks. More specifically, a novel 3D convolutional layer that we call Local Binary Volume (LBV) layer is proposed. The LBV layer, when used with our newly proposed LBVCNN network, achieve comparable results compared to state-of-the-art landmark-based or without landmark-based models on image sequences from CK+, Oulu-CASIA, and UNBC McMaster shoulder pain datasets. Furthermore, our LBV layer reduces the number of trainable parameters by a significant amount when compared to a conventional 3D convolutional layer. As a matter of fact, when compared to a 3x3x3 conventional 3D convolutional layer, the LBV layer uses 27 times less trainable parameters.