MM AI SD ASDec 12, 2023

More than Vanilla Fusion: a Simple, Decoupling-free, Attention Module for Multimodal Fusion Based on Signal Theory

Peiwen Sun, Yifan Zhang, Zishan Liu, Donghao Chen, Honggang Zhang

arXiv:2312.07212v11.2h-index: 5

Originality Highly original

AI Analysis

This work addresses multimodal fusion for audio-visual tasks, offering incremental improvements with a novel method for a known bottleneck.

The paper tackles the problem of vanilla multimodal fusion by proposing a simple, plug-and-play attention module based on signal theory and a decoupling-free gradient modulation scheme, achieving up to 2.0% performance improvements in multimodal classification methods.

The vanilla fusion methods still dominate a large percentage of mainstream audio-visual tasks. However, the effectiveness of vanilla fusion from a theoretical perspective is still worth discussing. Thus, this paper reconsiders the signal fused in the multimodal case from a bionics perspective and proposes a simple, plug-and-play, attention module for vanilla fusion based on fundamental signal theory and uncertainty theory. In addition, previous work on multimodal dynamic gradient modulation still relies on decoupling the modalities. So, a decoupling-free gradient modulation scheme has been designed in conjunction with the aforementioned attention module, which has various advantages over the decoupled one. Experiment results show that just a few lines of code can achieve up to 2.0% performance improvements to several multimodal classification methods. Finally, quantitative evaluation of other fusion tasks reveals the potential for additional application scenarios.

View on arXiv PDF

Similar