LGMLFeb 18, 2019

Regularizing Black-box Models for Improved Interpretability

arXiv:1902.06787v684 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of unreliable explanations in machine learning for practitioners, offering a model-agnostic solution that enhances interpretability without sacrificing accuracy, though it is incremental as it builds on existing interpretability approaches.

The paper tackles the trade-off between model accuracy and interpretability by introducing ExpO, a method that regularizes black-box models for explanation quality during training, resulting in improved fidelity and stability of post-hoc explanations, as validated by a user study.

Most of the work on interpretable machine learning has focused on designing either inherently interpretable models, which typically trade-off accuracy for interpretability, or post-hoc explanation systems, whose explanation quality can be unpredictable. Our method, ExpO, is a hybridization of these approaches that regularizes a model for explanation quality at training time. Importantly, these regularizers are differentiable, model agnostic, and require no domain knowledge to define. We demonstrate that post-hoc explanations for ExpO-regularized models have better explanation quality, as measured by the common fidelity and stability metrics. We verify that improving these metrics leads to significantly more useful explanations with a user study on a realistic task.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes