LGHCSEDec 14, 2020

Enabling Collaborative Data Science Development with the Ballet Framework

arXiv:2012.07816v51 citationsHas Code
AI Analysis

This work aims to enable larger-scale, open-source-style collaboration for data science projects, particularly for teams struggling with feature engineering and integration.

This paper addresses the challenge of scaling data science collaborations, which are typically individual or small-team efforts. They introduce Ballet, a lightweight framework and cloud environment focused on collaborative feature engineering, enabling 27 collaborators to incrementally propose and merge features into an ML pipeline for an income prediction problem.

While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, a lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to an ML performance evaluation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes