DBLGApr 21, 2024

Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie

arXiv:2404.13682v19 citationsh-index: 11Has CodeDEEM@SIGMOD
Originality Incremental advance
AI Analysis

This addresses reproducibility issues for data engineers working with large-scale data lakes, though it appears incremental as it builds on existing tools like Nessie.

The paper tackles the challenge of ensuring reproducible data workloads in Lakehouse architectures by introducing a system that decouples compute from data management using Bauplan's cloud runtime and Nessie's Git-like catalog. It demonstrates capabilities like time-travel and branching on object storage, enabling full pipeline reproducibility with simple CLI commands.

As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes