MLAICLLGMar 15, 2023

Understanding Post-hoc Explainers: The Case of Anchors

arXiv:2303.08806v13 citationsh-index: 22
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of unreliable interpretability methods for users of machine learning models, though it is incremental as it focuses on validating an existing method on simple models.

The paper tackles the lack of theoretical guarantees in post-hoc explainers by analyzing Anchors, a rule-based interpretability method, and demonstrates mathematically that it produces meaningful results for linear text classifiers with TF-IDF vectorization.

In many scenarios, the interpretability of machine learning models is a highly required but difficult task. To explain the individual predictions of such models, local model-agnostic approaches have been proposed. However, the process generating the explanations can be, for a user, as mysterious as the prediction to be explained. Furthermore, interpretability methods frequently lack theoretical guarantees, and their behavior on simple models is frequently unknown. While it is difficult, if not impossible, to ensure that an explainer behaves as expected on a cutting-edge model, we can at least ensure that everything works on simple, already interpretable models. In this paper, we present a theoretical analysis of Anchors (Ribeiro et al., 2018): a popular rule-based interpretability method that highlights a small set of words to explain a text classifier's decision. After formalizing its algorithm and providing useful insights, we demonstrate mathematically that Anchors produces meaningful results when used with linear text classifiers on top of a TF-IDF vectorization. We believe that our analysis framework can aid in the development of new explainability methods based on solid theoretical foundations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes