MMApr 28

Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

Zhaoyan Pan, Hengyang Zhou, Xiangdong Li, Yuning Wang, Ye Lou, Jiatong Pan, Ji Zhou, Wei Zhang

arXiv:2604.2561850.8

Predicted impact top 57% in MM · last 90 daysOriginality Highly original

AI Analysis

For researchers in multimodal dialogue systems, this work introduces a novel cue-guided interaction paradigm that improves context-dependent understanding over existing methods.

The paper tackles conversational multimodal understanding by proposing CUCI-Net, which abstracts context-utterance dependency into an explicit cue and integrates it into multimodal reasoning, achieving state-of-the-art results on benchmark datasets.

Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, or propagation, but rarely abstract the context-utterance dependency into an explicit cue and incorporate it into later multimodal reasoning. To address this issue, we propose CUCI-Net for conversational multimodal understanding. CUCI-Net fully preserves the structural distinction between context and utterance during encoding, effectively abstracts their dependency into an interpretation cue by combining local modality evidence with global contextual evidence, and seamlessly integrates the resulting cue into the final multimodal interaction stage for context-conditioned prediction. Extensive experiments on mainstream benchmark datasets fully demonstrate the effectiveness of the proposed method.

View on arXiv PDF

Similar