LG AIJun 29, 2024

Explainability of Machine Learning Models under Missing Data

Tuan L. Vo, Thu Nguyen, Luis M. Lopez-Ramos, Hugo L. Hammer, Michael A. Riegler, Pal Halvorsen

arXiv:2407.00411v311.510 citationsHas Code

Originality Incremental advance

AI Analysis

It addresses the problem of ensuring reliable model explanations for practitioners when dealing with missing data, though it is incremental in evaluating existing imputation methods.

This paper investigates how missing data affects the explainability of machine learning models, finding that different imputation methods can introduce biases in SHAP values, and that lower prediction error does not guarantee accurate explanations.

Missing data is a prevalent issue that can significantly impair model performance and explainability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on SHAP (SHapley Additive exPlanations), a popular technique for explaining the output of complex machine learning models. Next, we compare different imputation strategies and assess their impact on feature importance and interaction as determined by Shapley values. Moreover, we also theoretically analyze the effects of missing values on Shapley values. Importantly, our findings reveal that the choice of imputation method can introduce biases that could lead to changes in the Shapley values, thereby affecting the explainability of the model. Moreover, we also show that a lower test prediction MSE (Mean Square Error) does not necessarily imply a lower MSE in Shapley values and vice versa. Also, while XGBoost (eXtreme Gradient Boosting) is a method that could handle missing data directly, using XGBoost directly on missing data can seriously affect explainability compared to imputing the data before training XGBoost. This study provides a comprehensive evaluation of imputation methods in the context of model explanations, offering practical guidance for selecting appropriate techniques based on dataset characteristics and analysis objectives. The results underscore the importance of considering imputation effects to ensure robust and reliable insights from machine learning models.

View on arXiv PDF Code

Similar