Masahiro Mitsuhara

2papers

2 Papers

CVOct 29, 2021
ST-ABN: Visual Explanation Taking into Account Spatio-temporal Information for Video Recognition

Masahiro Mitsuhara, Tsubasa Hirakawa, Takayoshi Yamashita et al.

It is difficult for people to interpret the decision-making in the inference process of deep neural networks. Visual explanation is one method for interpreting the decision-making of deep learning. It analyzes the decision-making of 2D CNNs by visualizing an attention map that highlights discriminative regions. Visual explanation for interpreting the decision-making process in video recognition is more difficult because it is necessary to consider not only spatial but also temporal information, which is different from the case of still images. In this paper, we propose a visual explanation method called spatio-temporal attention branch network (ST-ABN) for video recognition. It enables visual explanation for both spatial and temporal information. ST-ABN acquires the importance of spatial and temporal information during network inference and applies it to recognition processing to improve recognition performance and visual explainability. Experimental results with Something-Something datasets V1 \& V2 demonstrated that ST-ABN enables visual explanation that takes into account spatial and temporal information simultaneously and improves recognition performance.

CVMay 9, 2019
Embedding Human Knowledge into Deep Neural Network via Attention Map

Masahiro Mitsuhara, Hiroshi Fukui, Yusuke Sakashita et al.

In this work, we aim to realize a method for embedding human knowledge into deep neural networks. While the conventional method to embed human knowledge has been applied for non-deep machine learning, it is challenging to apply it for deep learning models due to the enormous number of model parameters. To tackle this problem, we focus on the attention mechanism of an attention branch network (ABN). In this paper, we propose a fine-tuning method that utilizes a single-channel attention map which is manually edited by a human expert. Our fine-tuning method can train a network so that the output attention map corresponds to the edited ones. As a result, the fine-tuned network can output an attention map that takes into account human knowledge. Experimental results with ImageNet, CUB-200-2010, and IDRiD demonstrate that it is possible to obtain a clear attention map for a visual explanation and improve the classification performance. Our findings can be a novel framework for optimizing networks through human intuitive editing via a visual interface and suggest new possibilities for human-machine cooperation in addition to the improvement of visual explanations.