CVIVDec 13, 2018

Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering

arXiv:1812.05252v4414 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of effectively integrating visual and language features for visual question answering, representing an incremental improvement in a domain-specific task.

The paper tackles the problem of multi-modality feature fusion in visual question answering by proposing a dynamic fusion method with intra- and inter-modality attention flow, which achieves state-of-the-art performance on the VQA 2.0 dataset.

Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fusing multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that the proposed dynamic intra-modality attention flow conditioned on the other modality can dynamically modulate the intra-modality attention of the target modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes