Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie
This addresses reproducibility issues for data engineers working with large-scale data lakes, though it appears incremental as it builds on existing tools like Nessie.
The paper tackles the challenge of ensuring reproducible data workloads in Lakehouse architectures by introducing a system that decouples compute from data management using Bauplan's cloud runtime and Nessie's Git-like catalog. It demonstrates capabilities like time-travel and branching on object storage, enabling full pipeline reproducibility with simple CLI commands.
As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands.