Upper, Middle and Lower Region Learning for Facial Action Unit Detection
This work addresses the challenge of balancing region-specific focus with information retention in facial expression analysis, offering an incremental improvement for computer vision applications.
The paper tackles facial action unit detection by dividing the face into three broad regions and using a novel deep learning framework, achieving the highest F1 scores for AU1, AU2, and AU4 and the highest overall accuracy on the DISFA dataset compared to state-of-the-art methods.
Facial action units (AUs) detection is fundamental to facial expression analysis. As AU occurs only in a small area of the face, region-based learning has been widely recognized useful for AU detection. Most region-based studies focus on a small region where the AU occurs. Focusing on a specific region helps eliminate the influence of identity, but bringing a risk for losing information. It is challenging to find balance. In this study, I propose a simple strategy. I divide the face into three broad regions, upper, middle, and lower region, and group AUs based on where it occurs. I propose a new end-to-end deep learning framework named three regions based attention network (TRA-Net). After extracting the global feature, TRA-Net uses a hard attention module to extract three feature maps, each of which contains only a specific region. Each region-specific feature map is fed to an independent branch. For each branch, three continuous soft attention modules are used to extract higher-level features for final AU detection. In the DISFA dataset, this model achieves the highest F1 scores for the detection of AU1, AU2, and AU4, and produces the highest accuracy in comparison with the state-of-the-art methods.