CVAINov 19, 2023

Inspecting Explainability of Transformer Models with Additional Statistical Information

arXiv:2311.11378v25 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses the need for better explainability in vision Transformers, particularly for variants like Swin Transformer, though it appears incremental as it builds on prior visualization techniques.

The paper tackles the problem of interpreting Transformer models in vision tasks by visualizing attention, finding that existing methods fail on variants like Swin Transformer. Their method, which incorporates statistics from layer normalization layers, effectively explains Swin Transformer and ViT, showing improved focus on predicted objects.

Transformer becomes more popular in the vision domain in recent years so there is a need for finding an effective way to interpret the Transformer model by visualizing it. In recent work, Chefer et al. can visualize the Transformer on vision and multi-modal tasks effectively by combining attention layers to show the importance of each image patch. However, when applying to other variants of Transformer such as the Swin Transformer, this method can not focus on the predicted object. Our method, by considering the statistics of tokens in layer normalization layers, shows a great ability to interpret the explainability of Swin Transformer and ViT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes