3 Papers

24.0CVJun 2
IDO: Incongruity-aware Distribution Optimization for Multimodal Fake News Detection

Hengyang Zhou, Rongman Hong, Yuxuan Zhou et al.

Multimodal fake news detection aims to identify the authenticity of news. Existing multimodal fake news detection methods mainly focus on cross-modal consistency, but often fail to explicitly model the semantic incongruity that characterizes deceptive multimodal content. However, misinformation often contains semantic information incongruity with the facts. To address these challenges, we propose Incongruity-aware Distribution Optimization (IDO) to improve the performance of fake news detection from the perspectives of factual incongruity and modality incongruity. For factual incongruity, we introduce a channel-wise reweighting strategy to obtain semantically discriminative embeddings and utilize gaussian distribution to model the uncertain correlation caused by factual incongruity. For modality incongruity, we utilize incongruity contrastive learning to learn cross-modal semantic information. Experiments demonstrate that IDO achieves state-of-the-art performance.

63.7MMMay 28
State-Anchored Complete-View Distillation for Robust Conversational Multimodal Emotion Recognition

Zhaoyan Pan, Xiangdong Li, Wenke Wu et al.

Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonverbal cues may conflict with the target utterance. To this end, we propose CoRe-KD (Complete-view Reference-guided Knowledge Distillation), a state-anchored, conflict-regularized complete-view distillation framework for robust conversational MER. A complete-view teacher provides structured references, including prediction-level references, fused states, and modality-specific states. Complete-view State Anchoring (CSA) aligns incomplete-view student predictions and states with these references, while Nonverbal Conflict Exposure (NCE) trains on target-preserving nonverbal conflict views to reduce donor-label bias. Experiments on IEMOCAP and MELD, with CMU-MOSEI as a supplementary utterance-level check, show consistent gains under fixed- and random-missing protocols. Comprehensive ablation studies and further analyses support the role of CSA and the complementary effect of NCE.

48.7MMApr 28
Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

Zhaoyan Pan, Hengyang Zhou, Xiangdong Li et al.

Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, or propagation, but rarely abstract the context-utterance dependency into an explicit cue and incorporate it into later multimodal reasoning. To address this issue, we propose CUCI-Net for conversational multimodal understanding. CUCI-Net fully preserves the structural distinction between context and utterance during encoding, effectively abstracts their dependency into an interpretation cue by combining local modality evidence with global contextual evidence, and seamlessly integrates the resulting cue into the final multimodal interaction stage for context-conditioned prediction. Extensive experiments on mainstream benchmark datasets fully demonstrate the effectiveness of the proposed method.