CVMar 11

Hierarchical Granularity Alignment and State Space Modeling for Robust Multimodal AU Detection in the Wild

Jun Yu, Yunxiang Zhang, Naixiang Zheng, Lingsi Zhu, Guoyuan Wang

arXiv:2603.11306v14.8h-index: 1

Predicted impact top 80% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the problem of accurate emotion analysis from facial expressions in real-world settings for applications like human-computer interaction, though it appears incremental by building on existing multimodal and foundation model approaches.

The paper tackled robust facial Action Unit detection in unconstrained environments by proposing a multimodal framework with hierarchical granularity alignment and state space modeling, achieving state-of-the-art performance on the Aff-Wild2 dataset and top rankings in a competition.

Facial Action Unit (AU) detection in in-the-wild environments remains a formidable challenge due to severe spatial-temporal heterogeneity, unconstrained poses, and complex audio-visual dependencies. While recent multimodal approaches have made progress, they often rely on capacity-limited encoders and shallow fusion mechanisms that fail to capture fine-grained semantic shifts and ultra-long temporal contexts. To bridge this gap, we propose a novel multimodal framework driven by Hierarchical Granularity Alignment and State Space Models.Specifically, we leverage powerful foundation models, namely DINOv2 and WavLM, to extract robust and high-fidelity visual and audio representations, effectively replacing traditional feature extractors. To handle extreme facial variations, our Hierarchical Granularity Alignment module dynamically aligns global facial semantics with fine-grained local active patches. Furthermore, we overcome the receptive field limitations of conventional temporal convolutional networks by introducing a Vision-Mamba architecture. This approach enables temporal modeling with O(N) linear complexity, effectively capturing ultra-long-range dynamics without performance degradation. A novel asymmetric cross-attention mechanism is also introduced to deeply synchronize paralinguistic audio cues with subtle visual movements.Extensive experiments on the challenging Aff-Wild2 dataset demonstrate that our approach significantly outperforms existing baselines, achieving state-of-the-art performance. Notably, this framework secured top rankings in the AU Detection track of the 10th Affective Behavior Analysis in-the-wild Competition.

View on arXiv PDF

Similar