LG SEJun 7, 2024

Automated Trustworthiness Testing for Machine Learning Classifiers

Steven Cho, Seaton Cousins-Baxter, Stefano Ruberto, Valerio Terragni

arXiv:2406.05251v12.6h-index: 18

Originality Incremental advance

AI Analysis

This work addresses the need for automated trustworthiness testing in critical domains like finance and healthcare, but it is incremental as it builds on existing explainable techniques without achieving strong performance.

The paper tackled the problem of automatically assessing the trustworthiness of machine learning classifiers by proposing TOWER, a technique that uses word embeddings to evaluate explanations from models, but found it ineffective on a human-labeled dataset while detecting decreased trustworthiness with increased noise.

Machine Learning (ML) has become an integral part of our society, commonly used in critical domains such as finance, healthcare, and transportation. Therefore, it is crucial to evaluate not only whether ML models make correct predictions but also whether they do so for the correct reasons, ensuring our trust that will perform well on unseen data. This concept is known as trustworthiness in ML. Recently, explainable techniques (e.g., LIME, SHAP) have been developed to interpret the decision-making processes of ML models, providing explanations for their predictions (e.g., words in the input that influenced the prediction the most). Assessing the plausibility of these explanations can enhance our confidence in the models' trustworthiness. However, current approaches typically rely on human judgment to determine the plausibility of these explanations. This paper proposes TOWER, the first technique to automatically create trustworthiness oracles that determine whether text classifier predictions are trustworthy. It leverages word embeddings to automatically evaluate the trustworthiness of a model-agnostic text classifiers based on the outputs of explanatory techniques. Our hypothesis is that a prediction is trustworthy if the words in its explanation are semantically related to the predicted class. We perform unsupervised learning with untrustworthy models obtained from noisy data to find the optimal configuration of TOWER. We then evaluated TOWER on a human-labeled trustworthiness dataset that we created. The results show that TOWER can detect a decrease in trustworthiness as noise increases, but is not effective when evaluated against the human-labeled dataset. Our initial experiments suggest that our hypothesis is valid and promising, but further research is needed to better understand the relationship between explanations and trustworthiness issues.

View on arXiv PDF

Similar