CVAug 13, 2025

MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

Thanh-Dat Truong, Christophe Bobda, Nitin Agarwal, Khoa Luu

arXiv:2508.10133v13 citationsh-index: 16

Originality Incremental advance

AI Analysis

This addresses the challenge of capturing essential features and complex correlations in multimodal data for applications in computer vision and multimedia, though it appears incremental by combining attention with normalizing flows.

The paper tackles the problem of multimodal fusion learning by introducing MANGO, a Multimodal Attention-based Normalizing Flow approach, which achieves state-of-the-art performance on tasks like semantic segmentation, image-to-image translation, and movie genre classification.

Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach\footnote{The source code of this work will be publicly available.} to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.

View on arXiv PDF

Similar