MM AI CL CV LGSep 7, 2022

DM$^2$S$^2$: Deep Multi-Modal Sequence Sets with Hierarchical Modality Attention

Shunsuke Kitada, Yuki Iwazaki, Riku Togashi, Hitoshi Iyatomi

arXiv:2209.03126v22.32 citationsh-index: 29

Originality Incremental advance

AI Analysis

This work addresses challenges in multimodal data fusion for web applications like digital advertising and e-commerce, offering an incremental improvement over existing set-aware models.

The paper tackles the problem of handling multimodal data with increasing modalities, which leads to high-dimensional concatenated features and missing modalities, by proposing a set-aware concept called deep multimodal sequence sets (DM^2S^2) that captures relationships among modalities using BERT-based encoders and residual attention mechanisms, achieving performance comparable to or better than previous set-aware models.

There is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple encoders. However, as the number of modalities increases, several potential problems with the mid-fusion model structure arise, such as an increase in the dimensionality of the concatenated multimodal features and missing modalities. To address these problems, we propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets (DM$^2$S$^2$). Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter- and intra-order of elements in the sequences, (b) intra-modality residual attention (IntraMRA) to capture the importance of the elements in a modality, and (c) inter-modality residual attention (InterMRA) to enhance the importance of elements with modality-level granularity further. Our concept exhibits performance that is comparable to or better than the previous set-aware models. Furthermore, we demonstrate that the visualization of the learned InterMRA and IntraMRA weights can provide an interpretation of the prediction results.

View on arXiv PDF

Similar