Temporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis

Chunlei Meng, Ziyang Zhou, Lucas He, Xiaojing Du, Chun Ouyang, Zhongxue Gan

arXiv:2601.13659v12.13 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in multimodal sentiment analysis for researchers, offering an incremental improvement over existing methods.

The paper tackles the problem of spatiotemporal information asymmetry in multimodal sentiment analysis by proposing TSDA, which decouples each modality into temporal and spatial components before interaction, resulting in improved performance over baselines.

Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.

View on arXiv PDF

Similar