CL LGAug 13, 2023

Robust Infidelity: When Faithfulness Measures on Masked Language Models Are Misleading

Evan Crothers, Herna Viktor, Nathalie Japkowicz

arXiv:2308.06795v20.91 citationsh-index: 44

Originality Incremental advance

AI Analysis

This work addresses a critical issue for researchers and practitioners in NLP by revealing pitfalls in common interpretability metrics, though it is incremental as it builds on existing methods.

The paper tackles the problem of misleading faithfulness measures in neural text classifiers by showing that iterative masking produces large variation in scores and unpredictable behavior due to out-of-distribution embeddings, undermining principled interpretability comparisons.

A common approach to quantifying neural text classifier interpretability is to calculate faithfulness metrics based on iteratively masking salient input tokens and measuring changes in the model prediction. We propose that this property is better described as "sensitivity to iterative masking", and highlight pitfalls in using this measure for comparing text classifier interpretability. We show that iterative masking produces large variation in faithfulness scores between otherwise comparable Transformer encoder text classifiers. We then demonstrate that iteratively masked samples produce embeddings outside the distribution seen during training, resulting in unpredictable behaviour. We further explore task-specific considerations that undermine principled comparison of interpretability using iterative masking, such as an underlying similarity to salience-based adversarial attacks. Our findings give insight into how these behaviours affect neural text classifiers, and provide guidance on how sensitivity to iterative masking should be interpreted.

View on arXiv PDF

Similar