MMMay 29

Dynamic Interaction-Aware and Causality-Disentangled Framework for Multimodal Sentiment Analysis

Guangyuan Dong, Ziwei Hong, Shenghao Liu, Chenyu Wu, Yuanyuan Fang, Zihao Li, Xudong Zhang, Bingchen Liu, Yuchen Zhang, Haitao Ding, Zhenzhou Zhou, Ziyu Song

arXiv:2605.3099472.6h-index: 6

AI Analysis

This work provides an incremental improvement for researchers and practitioners working on multimodal sentiment analysis, specifically by enhancing the handling of dynamic interactions and language bias.

This paper addresses challenges in Multimodal Sentiment Analysis (MSA) by proposing a framework that dynamically adapts to variations across samples and disentangles inherent sentimental bias from language features. The framework achieves new state-of-the-art results on the CMU-MOSI and CMU-MOSEI benchmarks, with Acc-2/F1 scores of 86.52%/86.51% on MOSI and 86.72%/86.65% on MOSEI.

Although Multimodal Sentiment Analysis (MSA) effectively leverages rich information from language, visual, and acoustic modalities, existing methods still face two core challenges: 1) static conflict suppression mechanisms fail to adapt to dynamic variations across samples, and 2) the inherent sentimental bias within the language modality, which can misguide learning from other modalities, remains entangled. To this end, we propose a Dynamic Multimodal Causal Disentanglement and Adaptive Fusion Framework (MCAF). Its cornerstone is the Multi-Granularity Causal Dynamic Router and a Conditional Diffusion Denoising Module. First, we introduce a causal intervention module based on the information bottleneck principle, which builds a Structural Causal Model to disentangle sentimental bias from language features, yielding a "de-confounded" language representation as a pure guiding signal. Second, we devise a Dynamic Multimodal Router that evaluates the interaction states (complementary, conflicting, or redundant) among visual, acoustic, and de-confounded language signals in real-time across three levels: feature, temporal, and modality, then adaptively allocates weights and routes information flow for fine-grained regulation. Finally, a lightweight Conditional Diffusion Denoising Module performs iterative denoising on the fused joint representation to explicitly filter out residual irrelevant information, generating a robust hyper-modality representation. Extensive experiments on the CMU-MOSI and CMU-MOSEI benchmarks show that MCAF sets new state-of-the-art on key classification metrics, achieving an Acc-2/F1 of 86.52%/86.51% on MOSI and 86.72%/86.65% on MOSEI, while remaining highly competitive on others. Comprehensive analyses and visualizations further validate its efficacy in dynamically perceiving interactions, disentangling bias, and enhancing interpretability.

View on arXiv PDF

Similar