AIApr 14

MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

Shufang Lin, Muyang Chen, Xiabing Zhou, Rongrong Zhang, Dayou Zhang, Fangxin Wang

arXiv:2604.1270014.2h-index: 8

Predicted impact top 59% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers in multimodal intent recognition, this work provides a challenging benchmark and a baseline method that exposes and partially addresses limitations of existing MLLMs in strategic, deceptive interactions.

MISID introduces a multimodal, multi-turn dataset for complex intent recognition in strategic deception games, revealing that current MLLMs struggle with text-prior visual hallucination and cross-modal reasoning. The proposed FRACTAM framework improves hidden intent detection and inference accuracy.

Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios, including text-prior visual hallucination, impaired cross-modal synergy, and limited capacity in chaining causal cues. Consequently, we propose FRACTAM as a baseline framework. Using a ``Decouple-Anchor-Reason'' paradigm, FRACTAM reduces text bias by extracting pure unimodal factual representations, employs two-stage retrieval for long-range factual anchoring, and constructs explicit cross-modal evidence chains. Extensive experiments demonstrate that FRACTAM enhances mainstream models' performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset is available at https://naislab.cn/datasets/MISID.

View on arXiv PDF

Similar