CVCLSep 27, 2021

Multimodal Integration of Human-Like Attention in Visual Question Answering

arXiv:2109.13139v125 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of multimodal attention integration for VQA researchers, though it is incremental as it builds on existing transformer-based models and saliency methods.

The paper tackles the problem of integrating human-like attention across both image and text modalities in visual question answering (VQA), achieving a new state-of-the-art accuracy of 73.98% on the VQAv2 dataset with 80% fewer trainable parameters than prior work.

Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to uni-modal integration - even for inherently multimodal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) - the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN achieves a new state-of-the-art performance of 73.98% accuracy on test-std and 73.72% on test-dev and, at the same time, has approximately 80% fewer trainable parameters than prior work. Overall, our work underlines the potential of integrating multimodal human-like and neural attention for VQA

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes