LG AI MLJun 22, 2020

Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles

Sheeba Samuel, Frank Löffler, Birgitta König-Ries

arXiv:2006.12117v111.151 citations

Originality Synthesis-oriented

AI Analysis

This addresses the problem of unreliable ML results for researchers and practitioners, but it is incremental as it builds on existing reproducibility and FAIR principles.

The paper tackles the reproducibility crisis in machine learning by investigating factors beyond source code and datasets that affect reproducibility, and proposes applying FAIR data principles to ML workflows, with preliminary results showing the role of ProvBook in capturing and comparing provenance in Jupyter Notebooks.

Machine learning (ML) is an increasingly important scientific tool supporting decision making and knowledge generation in numerous fields. With this, it also becomes more and more important that the results of ML experiments are reproducible. Unfortunately, that often is not the case. Rather, ML, similar to many other disciplines, faces a reproducibility crisis. In this paper, we describe our goals and initial steps in supporting the end-to-end reproducibility of ML pipelines. We investigate which factors beyond the availability of source code and datasets influence reproducibility of ML experiments. We propose ways to apply FAIR data practices to ML workflows. We present our preliminary results on the role of our tool, ProvBook, in capturing and comparing provenance of ML experiments and their reproducibility using Jupyter Notebooks.

View on arXiv PDF

Similar