CVOct 13, 2025

Evaluating the Explainability of Vision Transformers in Medical Imaging

arXiv:2510.12021v14 citationsh-index: 43
Originality Synthesis-oriented
AI Analysis

This research addresses the need for interpretable AI in medical imaging to support clinical trust and adoption, though it represents an incremental evaluation of existing methods on medical data.

This study evaluated the explainability of Vision Transformers in medical imaging by comparing different architectures and pre-training strategies using Gradient Attention Rollout and Grad-CAM on blood cell and breast ultrasound classification tasks. The results showed DINO with Grad-CAM provided the most faithful and localized explanations, highlighting clinically relevant features even in misclassifications.

Understanding model decisions is crucial in medical imaging, where interpretability directly impacts clinical trust and adoption. Vision Transformers (ViTs) have demonstrated state-of-the-art performance in diagnostic imaging; however, their complex attention mechanisms pose challenges to explainability. This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification. Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets. Grad-CAM consistently produces class-discriminative and spatially precise heatmaps, while Gradient Attention Rollout yields more scattered activations. Even in misclassification cases, DINO with Grad-CAM highlights clinically relevant morphological features that appear to have misled the model. By improving model transparency, this research supports the reliable and explainable integration of ViTs into critical medical diagnostic workflows.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes