LGAIMLJun 22, 2020

Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles

arXiv:2006.12117v151 citations
Originality Synthesis-oriented
AI Analysis

This addresses the problem of unreliable ML results for researchers and practitioners, but it is incremental as it builds on existing reproducibility and FAIR principles.

The paper tackles the reproducibility crisis in machine learning by investigating factors beyond source code and datasets that affect reproducibility, and proposes applying FAIR data principles to ML workflows, with preliminary results showing the role of ProvBook in capturing and comparing provenance in Jupyter Notebooks.

Machine learning (ML) is an increasingly important scientific tool supporting decision making and knowledge generation in numerous fields. With this, it also becomes more and more important that the results of ML experiments are reproducible. Unfortunately, that often is not the case. Rather, ML, similar to many other disciplines, faces a reproducibility crisis. In this paper, we describe our goals and initial steps in supporting the end-to-end reproducibility of ML pipelines. We investigate which factors beyond the availability of source code and datasets influence reproducibility of ML experiments. We propose ways to apply FAIR data practices to ML workflows. We present our preliminary results on the role of our tool, ProvBook, in capturing and comparing provenance of ML experiments and their reproducibility using Jupyter Notebooks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes