AIMay 12

Transformer Interpretability from Perspective of Attention and Gradient

arXiv:2605.1139256.6

Predicted impact top 66% in AI · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers and practitioners using Vision Transformers, this work offers a new interpretation method and highlights a potential security vulnerability, though the novelty is incremental.

The paper proposes a method for interpreting Transformer models by guiding gradient direction (attention direction), enabling more comprehensive feature region interpretation and detail interpretation. It also demonstrates a class rewriting phenomenon in Vision Transformers that could pose security risks.

Although researchers' attention is more focused on the performance of Transformer models, the interpretation of Transformer can never be ignored. Gradient is widely utilized in Transformer interpretation. From the perspective of attention and gradient, we conduct an in-depth study of Transformer interpretation and propose a method to achieve it by guiding the gradient direction, or more precisely, the attention direction. The method enables more comprehensive interpretation of feature regions, offers detail interpretation, and helps to better understand Transformer mechanism. Leveraging the difference in how Vision Transformer (ViT) and humans perceive images, we alter the class of an image in a way that is almost imperceptible to the human eye. This class rewriting phenomenon may potentially pose security risks in certain scenarios.

View on arXiv PDF

Similar