CVAILGJun 26, 2023

Beyond AUROC & co. for evaluating out-of-distribution detection performance

arXiv:2306.14658v111 citationsh-index: 70Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses a critical evaluation gap in OOD detection for safer AI systems, though it is incremental as it focuses on improving metrics rather than developing new detection methods.

The paper tackles the problem of evaluating out-of-distribution (OOD) detection methods by highlighting limitations in current metrics like AUROC and proposing a new metric called Area Under the Threshold Curve (AUTC) to better align with practical needs for safe AI.

While there has been a growing research interest in developing out-of-distribution (OOD) detection methods, there has been comparably little discussion around how these methods should be evaluated. Given their relevance for safe(r) AI, it is important to examine whether the basis for comparing OOD detection methods is consistent with practical needs. In this work, we take a closer look at the go-to metrics for evaluating OOD detection, and question the approach of exclusively reducing OOD detection to a binary classification task with little consideration for the detection threshold. We illustrate the limitations of current metrics (AUROC & its friends) and propose a new metric - Area Under the Threshold Curve (AUTC), which explicitly penalizes poor separation between ID and OOD samples. Scripts and data are available at https://github.com/glhr/beyond-auroc

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes