Preventing dataset shift from breaking machine-learning biomarkers
This work tackles dataset shift issues for biomedical researchers using ML biomarkers, but it is incremental as it offers an overview rather than a novel solution.
The paper addresses the problem of dataset shift undermining machine-learning biomarkers in biomedical research, providing an overview of detection and correction strategies to improve reliability.
Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can undermine the application of the biomarker to new individuals. Dataset shifts are frequent in biomedical research, e.g. because of recruitment biases. When a dataset shift occurs, standard machine-learning techniques do not suffice to extract and validate biomarkers. This article provides an overview of when and how dataset shifts breaks machine-learning extracted biomarkers, as well as detection and correction strategies.