MMAICVSPMay 27, 2025

WDMIR: Wavelet-Driven Multimodal Intent Recognition

arXiv:2506.10011v13 citationsh-index: 6IJCAI
Originality Incremental advance
AI Analysis

This work addresses the challenge of accurately interpreting user intentions in multimodal systems, which is incremental as it builds on existing methods by focusing on non-verbal information.

The paper tackles the problem of multimodal intent recognition by integrating verbal and non-verbal cues, achieving state-of-the-art performance with a 1.13% accuracy improvement on the MIntRec benchmark.

Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and non-verbal information. Extensive experiments on MIntRec demonstrate that our approach achieves state-of-the-art performance, surpassing previous methods by 1.13% on accuracy. Ablation studies further verify that the wavelet-driven fusion module significantly improves the extraction of semantic information from non-verbal sources, with a 0.41% increase in recognition accuracy when analyzing subtle emotional cues.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes