Comparing Feature Importance and Rule Extraction for Interpretability on Text Data
This addresses the reliability of interpretability methods for users in critical text-based tasks, but it is incremental as it focuses on comparison rather than introducing a new method.
The paper tackles the problem of inconsistent explanations from different interpretability methods on text data, showing that even simple models can yield unexpectedly different results, and proposes a new approach to quantify these differences.
Complex machine learning algorithms are used more and more often in critical tasks involving text data, leading to the development of interpretability methods. Among local methods, two families have emerged: those computing importance scores for each feature and those extracting simple logical rules. In this paper we show that using different methods can lead to unexpectedly different explanations, even when applied to simple models for which we would expect qualitative coincidence. To quantify this effect, we propose a new approach to compare explanations produced by different methods.