Inspecting Explainability of Transformer Models with Additional Statistical Information
This work addresses the need for better explainability in vision Transformers, particularly for variants like Swin Transformer, though it appears incremental as it builds on prior visualization techniques.
The paper tackles the problem of interpreting Transformer models in vision tasks by visualizing attention, finding that existing methods fail on variants like Swin Transformer. Their method, which incorporates statistics from layer normalization layers, effectively explains Swin Transformer and ViT, showing improved focus on predicted objects.
Transformer becomes more popular in the vision domain in recent years so there is a need for finding an effective way to interpret the Transformer model by visualizing it. In recent work, Chefer et al. can visualize the Transformer on vision and multi-modal tasks effectively by combining attention layers to show the importance of each image patch. However, when applying to other variants of Transformer such as the Swin Transformer, this method can not focus on the predicted object. Our method, by considering the statistics of tokens in layer normalization layers, shows a great ability to interpret the explainability of Swin Transformer and ViT.