CLAIJun 24, 2022

Robustness of Explanation Methods for NLP Models

arXiv:2206.12284v14 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses the vulnerability of explanation methods in NLP, which is crucial for trust and interpretability in AI applications, and is an incremental step as it focuses on text modality.

The paper tackles the problem of unreliable explanation methods for NLP models by evaluating their adversarial robustness, showing that small input changes can disturb explanations for up to 86% of tested samples.

Explanation methods have emerged as an important tool to highlight the features responsible for the predictions of neural networks. There is mounting evidence that many explanation methods are rather unreliable and susceptible to malicious manipulations. In this paper, we particularly aim to understand the robustness of explanation methods in the context of text modality. We provide initial insights and results towards devising a successful adversarial attack against text explanations. To our knowledge, this is the first attempt to evaluate the adversarial robustness of an explanation method. Our experiments show the explanation method can be largely disturbed for up to 86% of the tested samples with small changes in the input sentence and its semantics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes