CV AI LGDec 18, 2017

Visual Explanations from Hadamard Product in Multimodal Deep Networks

arXiv:1712.06228v10.9

Originality Synthesis-oriented

AI Analysis

This work provides incremental insights into understanding attentional mechanisms in multimodal deep networks, primarily benefiting researchers in explainable AI and visual question answering.

The paper demonstrates that the Hadamard product in multimodal deep networks performs an attentional mechanism for both visual and textual inputs, extending prior work that showed it only for visual inputs, using a gradient-based visualization technique and comparing it with learned attentional weights in a visual question answering model.

The visual explanation of learned representation of models helps to understand the fundamentals of learning. The attentional models of previous works used to visualize the attended regions over an image or text using their learned weights to confirm their intended mechanism. Kim et al. (2016) show that the Hadamard product in multimodal deep networks, which is well-known for the joint function of visual question answering tasks, implicitly performs an attentional mechanism for visual inputs. In this work, we extend their work to show that the Hadamard product in multimodal deep networks performs not only for visual inputs but also for textual inputs simultaneously using the proposed gradient-based visualization technique. The attentional effect of Hadamard product is visualized for both visual and textual inputs by analyzing the two inputs and an output of the Hadamard product with the proposed method and compared with learned attentional weights of a visual question answering model.

View on arXiv PDF

Similar