The Definitions of Interpretability and Learning of Interpretable Models
This work addresses the need for interpretability in machine learning applications, which is crucial for adoption in sensitive domains, but it appears incremental as it builds on existing interpretability concepts.
The paper tackles the problem of defining and training interpretable machine learning models by proposing a mathematical definition for human-interpretable models and a practical framework for training them through user interactions. Experiments on image datasets show that the model provides a human-understandable decision-making process and is more robust against adversarial attacks.
As machine learning algorithms getting adopted in an ever-increasing number of applications, interpretation has emerged as a crucial desideratum. In this paper, we propose a mathematical definition for the human-interpretable model. In particular, we define interpretability between two information process systems. If a prediction model is interpretable by a human recognition system based on the above interpretability definition, the prediction model is defined as a completely human-interpretable model. We further design a practical framework to train a completely human-interpretable model by user interactions. Experiments on image datasets show the advantages of our proposed model in two aspects: 1) The completely human-interpretable model can provide an entire decision-making process that is human-understandable; 2) The completely human-interpretable model is more robust against adversarial attacks.