Solution for 8th Competition on Affective & Behavior Analysis in-the-wild
This work addresses facial expression analysis for affective computing applications, but appears incremental as it builds on existing datasets and models.
The paper tackled the problem of robust and accurate facial action unit detection in-the-wild by introducing an innovative audio-visual multimodal method, achieving enhanced accuracy on the Aff-Wild2 dataset.
In this report, we present our solution for the Action Unit (AU) Detection Challenge, in 8th Competition on Affective Behavior Analysis in-the-wild. In order to achieve robust and accurate classification of facial action unit in the wild environment, we introduce an innovative method that leverages audio-visual multimodal data. Our method employs ConvNeXt as the image encoder and uses Whisper to extract Mel spectrogram features. For these features, we utilize a Transformer encoder-based feature fusion module to integrate the affective information embedded in audio and image features. This ensures the provision of rich high-dimensional feature representations for the subsequent multilayer perceptron (MLP) trained on the Aff-Wild2 dataset, enhancing the accuracy of AU detection.