Building a Correct-by-Design Lakehouse. Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents

Weiming Sheng, Jinlang Wang, Manuel Barros, Aldrin Montana, Jacopo Tagliabue, Luca Bigon

arXiv:2602.02335v14.33 citationsh-index: 11

Originality Incremental advance

AI Analysis

This addresses safety issues in lakehouses for analytics and AI, making them more reliable for production data handling by humans and agents, though it appears incremental by applying software engineering principles to an existing platform.

The paper tackles the problem of unsafe concurrent operations in lakehouses, which cause runtime mismatches and partial data leaks, by designing Bauplan, a code-first lakehouse that uses typed table contracts, Git-like versioning, and transactional runs to ensure pipeline-level atomicity and checkable boundaries.

Lakehouses are the default cloud platform for analytics and AI, but they become unsafe when untrusted actors concurrently operate on production data: upstream-downstream mismatches surface only at runtime, and multi-table pipelines can leak partial effects. Inspired by software engineering, we design Bauplan, a code-first lakehouse that aims to make (most) illegal states unrepresentable using familiar abstractions. Bauplan acts along three axes: typed table contracts to make pipeline boundaries checkable, Git-like data versioning for review and reproducibility, and transactional runs that guarantee pipeline-level atomicity. We report early results from a lightweight formal transaction model and discuss future work motivated by counterexamples.

View on arXiv PDF

Similar