NEAILGMar 20, 2016

Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

arXiv:1603.06212v1607 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the need for accessible automation tools in data science, representing an incremental step toward fully automating pipeline design.

The paper tackles the problem of automating machine learning pipeline design for non-experts by introducing TPOT, a tree-based pipeline optimization tool, which demonstrates significant improvement over basic analyses on benchmark datasets with little user input.

As the field of data science continues to grow, there will be an ever-increasing demand for tools that make machine learning accessible to non-experts. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a series of simulated and real-world benchmark data sets. In particular, we show that TPOT can design machine learning pipelines that provide a significant improvement over a basic machine learning analysis while requiring little to no input nor prior knowledge from the user. We also address the tendency for TPOT to design overly complex pipelines by integrating Pareto optimization, which produces compact pipelines without sacrificing classification accuracy. As such, this work represents an important step toward fully automating machine learning pipeline design.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes