Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing
This work addresses the challenge of capturing complex and conflicting evidence across modalities in affective computing, which is incremental but enhances emotion understanding for human-computer interaction.
The paper tackles the problem of multi-modal affective computing by proposing a representation decomposition approach that separates shared and modality-specific components, achieving consistent performance improvements over strong baselines and state-of-the-art models across three tasks.
Multi-modal affective computing aims to automatically recognize and interpret human attitudes from diverse data sources such as images and text, thereby enhancing human-computer interaction and emotion understanding. Existing approaches typically rely on unimodal analysis or straightforward fusion of cross-modal information that fail to capture complex and conflicting evidence presented across different modalities. In this paper, we propose a novel LLM-based approach for affective computing that explicitly deconstructs visual and textual representations into shared (modality-invariant) and modality-specific components. Specifically, our approach firstly encodes and aligns input modalities using pre-trained multi-modal encoders, then employs a representation decomposition framework to separate common emotional content from unique cues, and finally integrates these decomposed signals via an attention mechanism to form a dynamic soft prompt for a multi-modal LLM. Extensive experiments on three representative tasks for affective computing, namely, multi-modal aspect-based sentiment analysis, multi-modal emotion analysis, and hateful meme detection, demonstrate the effectiveness of our approach, which consistently outperforms strong baselines and state-of-the-art models.