LGQMAPMLNov 6, 2023

Validity problems in clinical machine learning by indirect data labeling using consensus definitions

arXiv:2311.03037v12 citationsh-index: 34
Originality Synthesis-oriented
AI Analysis

This addresses a critical flaw in disease diagnosis models for medical applications, highlighting an incremental but important methodological issue.

The paper identifies a validity problem in clinical machine learning where models trained on data with indirectly derived labels can perfectly reconstruct the target definition, leading to high performance on test data but catastrophic failure in real-world scenarios where underlying measurements are unavailable. It demonstrates this issue using sepsis prediction as an example.

We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation. Machine learning models trained on this data will learn nothing else but to exactly reconstruct the known target definition. Such models show perfect performance on similarly constructed test data but will fail catastrophically on real-world examples where the defining fundamental measurements are not or only incompletely available. We present a general procedure allowing identification of problematic datasets and black-box machine learning models trained on them, and exemplify our detection procedure on the task of early prediction of sepsis.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes