ML CY LGJun 2, 2023

Auditing for Human Expertise

Rohan Alur, Loren Laine, Darrick K. Li, Manish Raghavan, Devavrat Shah, Dennis Shung

MIT

arXiv:2306.01646v318.417 citationsh-index: 63Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of justifying automation in critical domains by providing a method to audit human expertise, though it is incremental in refining existing statistical approaches.

The paper tackles the problem of determining whether human experts add value beyond algorithmic predictors in high-stakes tasks like patient diagnosis, by developing a statistical framework to test if expert predictions are independent of outcomes after conditioning on available features, and applies it to emergency department data showing physicians incorporate information not captured by a screening tool despite its higher accuracy.

High-stakes prediction tasks (e.g., patient diagnosis) are often handled by trained human experts. A common source of concern about automation in these settings is that experts may exercise intuition that is difficult to model and/or have access to information (e.g., conversations with a patient) that is simply unavailable to a would-be algorithm. This raises a natural question whether human experts add value which could not be captured by an algorithmic predictor. We develop a statistical framework under which we can pose this question as a natural hypothesis test. Indeed, as our framework highlights, detecting human expertise is more subtle than simply comparing the accuracy of expert predictions to those made by a particular learning algorithm. Instead, we propose a simple procedure which tests whether expert predictions are statistically independent from the outcomes of interest after conditioning on the available inputs (`features'). A rejection of our test thus suggests that human experts may add value to any algorithm trained on the available data, and has direct implications for whether human-AI `complementarity' is achievable in a given prediction task. We highlight the utility of our procedure using admissions data collected from the emergency department of a large academic hospital system, where we show that physicians' admit/discharge decisions for patients with acute gastrointestinal bleeding (AGIB) appear to be incorporating information that is not available to a standard algorithmic screening tool. This is despite the fact that the screening tool is arguably more accurate than physicians' discretionary decisions, highlighting that -- even absent normative concerns about accountability or interpretability -- accuracy is insufficient to justify algorithmic automation.

View on arXiv PDF Code

Similar