LGHCJun 24, 2022

On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods

CMU
arXiv:2206.13503v43 citationsh-index: 51
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of unreliable evaluations of explainable ML methods for researchers and practitioners, emphasizing the need for application-grounded designs, though it is incremental as it builds on prior work.

The study evaluated three popular explainable ML methods in a real-world e-commerce fraud detection setting, finding no evidence for their incremental utility, highlighting how experimental design choices can lead to misleading conclusions.

Most existing evaluations of explainable machine learning (ML) methods rely on simplifying assumptions or proxies that do not reflect real-world use cases; the handful of more robust evaluations on real-world settings have shortcomings in their design, resulting in limited conclusions of methods' real-world utility. In this work, we seek to bridge this gap by conducting a study that evaluates three popular explainable ML methods in a setting consistent with the intended deployment context. We build on a previous study on e-commerce fraud detection and make crucial modifications to its setup relaxing the simplifying assumptions made in the original work that departed from the deployment context. In doing so, we draw drastically different conclusions from the earlier work and find no evidence for the incremental utility of the tested methods in the task. Our results highlight how seemingly trivial experimental design choices can yield misleading conclusions, with lessons about the necessity of not only evaluating explainable ML methods using tasks, data, users, and metrics grounded in the intended deployment contexts but also developing methods tailored to specific applications. In addition, we believe the design of this experiment can serve as a template for future study designs evaluating explainable ML methods in other real-world contexts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes