CVMay 7

Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition

Xiwen Luo, Jia Li, Rencheng Song, Yu Liu, Juan Cheng

arXiv:2605.0569469.5h-index: 9

Predicted impact top 45% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For emotion recognition researchers, this method improves cross-subject generalization by explicitly separating subject-specific and shared features, addressing a key limitation of existing multimodal fusion approaches.

The paper tackles video-based emotion recognition by fusing facial and rPPG signals. Their prompt-tuning framework with a decoupled adapter achieves state-of-the-art results on MAHNOB-HCI and DEAP, outperforming strong baselines in accuracy and cross-subject generalization.

Emotion recognition from facial videos enables non-contact inference of human emotional states. Although facial expressions are widely used cues, they cannot fully reflect intrinsic affective states. Remote photoplethysmography (rPPG) provides complementary physiological information, but it is highly susceptible to noise and inter-subject variability, limiting generalization to unseen individuals. Existing multimodal methods combine facial and rPPG features, yet their fusion strategies often disrupt pretrained facial representations and lack explicit mechanisms to suppress subject-specific variations. To address these issues, we propose a subject-invariant cross-modal prompt-tuning framework for video-based emotion recognition. Specifically, rPPG waveforms are transformed into noise-robust time-frequency representations (TFRs), from which modality-complementary prompts are generated to modulate facial tokens within a frozen Vision Transformer (ViT). This design enables effective cross-modal interaction while preserving the generalizable facial representations learned by the pretrained backbone. In addition, we introduce a decoupled shared-specific adapter (DSSA) into each ViT layer to explicitly separate subject-shared and subject-specific components, thereby improving cross-subject generalization. Experiments on the MAHNOB-HCI and DEAP benchmarks demonstrate that the proposed method consistently outperforms strong baselines in both recognition accuracy and generalization ability, highlighting its effectiveness for video-based emotion recognition.

View on arXiv PDF

Similar