LG AINov 7, 2023

On Leakage in Machine Learning Pipelines

Leonard Sasse, Eliana Nicolaisen-Sobesky, Juergen Dukart, Simon B. Eickhoff, Michael Götz, Sami Hamdan, Vera Komeyer, Abhijit Kulkarni, Juha Lahnakoski, Bradley C. Love, Federico Raimondo, Kaustubh R. Patil

arXiv:2311.04179v26.662 citationsh-index: 48

Originality Synthesis-oriented

AI Analysis

This addresses a critical issue for ML practitioners and researchers, as leakage can have severe financial and societal implications, but the work is incremental as it expands on existing understanding rather than introducing new methods.

The paper tackles the problem of data leakage in machine learning pipelines, which leads to overoptimistic performance estimates and poor generalization, by providing a comprehensive overview and discussion of various leakage types with concrete examples.

Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines.

View on arXiv PDF

Similar