On Baselines for Local Feature Attributions
This work highlights a critical sensitivity in interpretability methods for machine learning practitioners using tabular data, potentially leading to misinterpretations of model behavior.
This paper investigates the impact of baselines on the quality of local feature attributions for black-box models, specifically on tabular datasets. It empirically demonstrates that the choice of baseline significantly alters the discriminative power of feature attributions, complementing prior work on image data.
High-performing predictive models, such as neural nets, usually operate as black boxes, which raises serious concerns about their interpretability. Local feature attribution methods help to explain black box models and are therefore a powerful tool for assessing the reliability and fairness of predictions. To this end, most attribution models compare the importance of input features with a reference value, often called baseline. Recent studies show that the baseline can heavily impact the quality of feature attributions. Yet, we frequently find simplistic baselines, such as the zero vector, in practice. In this paper, we show empirically that baselines can significantly alter the discriminative power of feature attributions. We conduct our analysis on tabular data sets, thus complementing recent works on image data. Besides, we propose a new taxonomy of baseline methods. Our experimental study illustrates the sensitivity of popular attribution models to the baseline, thus laying the foundation for a more in-depth discussion on sensible baseline methods for tabular data.