Won-Ju Lee

5.9CVOct 4, 2023

ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for Vision Transformer

Seok-Yong Byun, Wonju Lee

This paper presents a novel approach to address the challenges of understanding the prediction process and debugging prediction errors in Vision Transformers (ViT), which have demonstrated superior performance in various computer vision tasks such as image classification and object detection. While several visual explainability techniques, such as CAM, Grad-CAM, Score-CAM, and Recipro-CAM, have been extensively researched for Convolutional Neural Networks (CNNs), limited research has been conducted on ViT. Current state-of-the-art solutions for ViT rely on class agnostic Attention-Rollout and Relevance techniques. In this work, we propose a new gradient-free visual explanation method for ViT, called ViT-ReciproCAM, which does not require attention matrix and gradient information. ViT-ReciproCAM utilizes token masking and generated new layer outputs from the target layer's input to exploit the correlation between activated tokens and network predictions for target classes. Our proposed method outperforms the state-of-the-art Relevance method in the Average Drop-Coherence-Complexity (ADCC) metric by $4.58\%$ to $5.80\%$ and generates more localized saliency maps. Our experiments demonstrate the effectiveness of ViT-ReciproCAM and showcase its potential for understanding and debugging ViT models. Our proposed method provides an efficient and easy-to-implement alternative for generating visual explanations, without requiring attention and gradient information, which can be beneficial for various applications in the field of computer vision.

4.8CVSep 28, 2022Code

Recipro-CAM: Fast gradient-free visual explanations for convolutional neural networks

Seok-Yong Byun, Wonju Lee

The Convolutional Neural Network (CNN) is a widely used deep learning architecture for computer vision. However, its black box nature makes it difficult to interpret the behavior of the model. To mitigate this issue, AI practitioners have explored explainable AI methods like Class Activation Map (CAM) and Grad-CAM. Although these methods have shown promise, they are limited by architectural constraints or the burden of gradient computing. To overcome this issue, Score-CAM and Ablation-CAM have been proposed as gradient-free methods, but they have longer execution times compared to CAM or Grad-CAM based methods, making them unsuitable for real-world solution though they resolved gradient related issues and enabled inference mode XAI. To address this challenge, we propose a fast gradient-free Reciprocal CAM (Recipro-CAM) method. Our approach involves spatially masking the extracted feature maps to exploit the correlation between activation maps and network predictions for target classes. Our proposed method has yielded promising results, outperforming current state-of-the-art method in the Average Drop-Coherence-Complexity (ADCC) metric by $1.78 \%$ to $3.72 \%$, excluding VGG-16 backbone. Moreover, Recipro-CAM generates saliency maps at a similar rate to Grad-CAM and is approximately $148$ times faster than Score-CAM. The source code for Recipro-CAM is available in our data analysis framework.

Won-Ju Lee

2 Papers