CVAICLApr 30, 2023

Multimodal Graph Transformer for Multimodal Question Answering

arXiv:2305.00581v1272 citationsh-index: 43
Originality Incremental advance
AI Analysis

This work addresses multimodal reasoning for AI systems, offering a hybrid approach that combines Transformers and graph neural networks, though it is incremental in nature.

The paper tackles the problem of multimodal question answering by proposing a Multimodal Graph Transformer that integrates graph information into self-attention to improve reasoning across modalities, achieving significant performance gains over Transformer baselines on datasets like GQA, VQAv2, and MultiModalQA.

Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes