Unmasking the Chameleons: A Benchmark for Out-of-Distribution Detection in Medical Tabular Data
This work addresses the challenge of reliably deploying ML models in healthcare by detecting OOD samples, though it is incremental as it benchmarks existing methods rather than introducing new ones.
The authors tackled the problem of out-of-distribution (OOD) detection in medical tabular data by proposing a benchmark using eICU and MIMIC-IV datasets, finding that the issue is resolved for far-OODs but remains open for near-OODs, with post-hoc methods improving when combined with distance-based mechanisms and transformers showing less overconfidence.
Despite their success, Machine Learning (ML) models do not generalize effectively to data not originating from the training distribution. To reliably employ ML models in real-world healthcare systems and avoid inaccurate predictions on out-of-distribution (OOD) data, it is crucial to detect OOD samples. Numerous OOD detection approaches have been suggested in other fields - especially in computer vision - but it remains unclear whether the challenge is resolved when dealing with medical tabular data. To answer this pressing need, we propose an extensive reproducible benchmark to compare different methods across a suite of tests including both near and far OODs. Our benchmark leverages the latest versions of eICU and MIMIC-IV, two public datasets encompassing tens of thousands of ICU patients in several hospitals. We consider a wide array of density-based methods and SOTA post-hoc detectors across diverse predictive architectures, including MLP, ResNet, and Transformer. Our findings show that i) the problem appears to be solved for far-OODs, but remains open for near-OODs; ii) post-hoc methods alone perform poorly, but improve substantially when coupled with distance-based mechanisms; iii) the transformer architecture is far less overconfident compared to MLP and ResNet.