CLAug 27, 2024

DualKanbaFormer: An Efficient Selective Sparse Framework for Multimodal Aspect-based Sentiment Analysis

Adamu Lawan, Juhua Pu, Haruna Yunusa, Muhammad Lawan, Aliyu Umar, Adamu Sani Yahya, Mahmoud Basi

arXiv:2408.15379v31.0h-index: 4

Originality Incremental advance

AI Analysis

This work addresses efficiency and accuracy challenges in MABSA for applications like social media analysis, though it appears incremental as it builds on existing sparse attention and multimodal fusion techniques.

The paper tackles the problem of quadratic complexity in attention mechanisms for Multimodal Aspect-based Sentiment Analysis (MABSA) by introducing DualKanbaFormer, which uses sparse attention and state space models to improve efficiency and performance, achieving state-of-the-art results on two public datasets.

Multimodal Aspect-based Sentiment Analysis (MABSA) enhances sentiment detection by integrating textual data with complementary modalities, such as images, to provide a more refined and comprehensive understanding of sentiment. However, conventional attention mechanisms, despite notable benchmarks, are hindered by quadratic complexity, limiting their ability to fully capture global contextual dependencies and rich semantic information in both modalities. To address this limitation, we introduce DualKanbaFormer, a novel framework that leverages parallel Textual and Visual KanbaFormer modules for robust multimodal analysis. Our approach incorporates Aspect-Driven Sparse Attention (ADSA) to dynamically balance coarse-grained aggregation and fine-grained selection for aspect-focused precision, ensuring the preservation of both global context awareness and local precision in textual and visual representations. Additionally, we utilize the Selective State Space Model (Mamba) to capture extensive global semantic information across both modalities. Furthermore, We replace traditional feed-forward networks and normalization with Kolmogorov-Arnold Networks (KANs) and Dynamic Tanh (DyT) to enhance non-linear expressivity and inference stability. To facilitate the effective integration of textual and visual features, we design a multimodal gated fusion layer that dynamically optimizes inter-modality interactions, significantly enhancing the models efficacy in MABSA tasks. Comprehensive experiments on two publicly available datasets reveal that DualKanbaFormer consistently outperforms several state-of-the-art (SOTA) models.

View on arXiv PDF

Similar