On the Usefulness of Deep Ensemble Diversity for Out-of-Distribution Detection
This work addresses a critical issue for safety-critical applications by refining OOD detection methods, though it is incremental as it builds on existing ensemble techniques.
The paper tackled the problem of out-of-distribution (OOD) detection in deep learning by challenging the intuition that ensemble diversity measures like Mutual Information improve performance, showing they can be 30-40% worse than single-model entropy on ImageNet-scale datasets, and proposed averaging task-specific scores like Energy for better results.
The ability to detect Out-of-Distribution (OOD) data is important in safety-critical applications of deep learning. The aim is to separate In-Distribution (ID) data drawn from the training distribution from OOD data using a measure of uncertainty extracted from a deep neural network. Deep Ensembles are a well-established method of improving the quality of uncertainty estimates produced by deep neural networks, and have been shown to have superior OOD detection performance compared to single models. An existing intuition in the literature is that the diversity of Deep Ensemble predictions indicates distributional shift, and so measures of diversity such as Mutual Information (MI) should be used for OOD detection. We show experimentally that this intuition is not valid on ImageNet-scale OOD detection -- using MI leads to 30-40% worse %FPR@95 compared to single-model entropy on some OOD datasets. We suggest an alternative explanation for Deep Ensembles' better OOD detection performance -- OOD detection is binary classification and we are ensembling diverse classifiers. As such we show that practically, even better OOD detection performance can be achieved for Deep Ensembles by averaging task-specific detection scores such as Energy over the ensemble.