CVAICLLGMay 21, 2018

Bilinear Attention Networks

arXiv:1805.07932v21042 citations
Originality Highly original
AI Analysis

This addresses the problem of efficient multimodal interaction for researchers and practitioners in computer vision and natural language processing, offering a novel method that improves performance over previous approaches.

The paper tackles the high computational cost of attention in multimodal learning by proposing bilinear attention networks (BAN), which use bilinear attention distributions to integrate vision-language information, achieving new state-of-the-art results on VQA 2.0 and Flickr30k Entities datasets.

Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.

Code Implementations8 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes