Achieving 3D Attention via Triplet Squeeze and Excitation Block
This work addresses performance enhancement in facial expression recognition, but it is incremental as it builds on existing CNN architectures like ConvNeXt.
The authors tackled the problem of improving CNN-based models for vision tasks, particularly facial expression recognition, by proposing a new attention mechanism called TripSE that combines Triplet attention with Squeeze-and-Excitation, achieving state-of-the-art results with 78.27% accuracy on the FER2013 dataset.
The emergence of ConvNeXt and its variants has reaffirmed the conceptual and structural suitability of CNN-based models for vision tasks, re-establishing them as key players in image classification in general, and in facial expression recognition (FER) in particular. In this paper, we propose a new set of models that build on these advancements by incorporating a new set of attention mechanisms that combines Triplet attention with Squeeze-and-Excitation (TripSE) in four different variants. We demonstrate the effectiveness of these variants by applying them to the ResNet18, DenseNet and ConvNext architectures to validate their versatility and impact. Our study shows that incorporating a TripSE block in these CNN models boosts their performances, particularly for the ConvNeXt architecture, indicating its utility. We evaluate the proposed mechanisms and associated models across four datasets, namely CIFAR100, ImageNet, FER2013 and AffectNet datasets, where ConvNext with TripSE achieves state-of-the-art results with an accuracy of \textbf{78.27\%} on the popular FER2013 dataset, a new feat for this dataset.