Luca Bigon

h-index3

5papers

48citations

Novelty45%

AI Score36

Ranked #98,483 of 194,257 authors (top 51%)#1,013 in IR (top 47%)

5 Papers

12.4DBJul 14

Not Your Usual Type(s): Data contracts as types across languages and engines

Aldrin Montana, Colin Marc, Luca Bigon et al.

Composable data systems promise to let developers combine languages, engines, and catalogs without sacrificing a coherent user experience. In practice, however, pipeline-node boundaries remain weakly specified: transformations exchange tables through schemas that are often checked late, enforced unevenly across languages, and disconnected from the semantics business users care about. Based on over a year of operating millions of jobs in Bauplan, we share the design principles behind our new SDK, which treats data contracts as types for a composable, multi-language lakehouse. Users, whether humans or agents, annotate input and output tables with schema objects that encode column types, constraints, documentation, and lineage; Bauplan then interprets these annotations at different points in the execution lifecycle. We show how this design addresses common production failures, and how an ''everything-as-code'' philosophy enables both deterministic and non-deterministic reasoning over data flows across languages and engines.

14.8DBJul 9

GitLake: Git-for-data for the agentic lakehouse

Weiming Sheng, Jinlang Wang, Manuel Barros et al.

We present GitLake, a Git-for-data design for an agent-first lakehouse. The system lifts single-table Iceberg snapshots into lakehouse-wide commits, branches, and merges, letting agents work on isolated branches while humans review and publish changes. Pipelines run on temporary branches and publish through a final merge, so all outputs become visible atomically or none do. Finally, we report production lessons as well as correctness insights from a preliminary Alloy model of our core abstractions.

4.3DCFeb 2

Building a Correct-by-Design Lakehouse. Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents

Weiming Sheng, Jinlang Wang, Manuel Barros et al.

Lakehouses are the default cloud platform for analytics and AI, but they become unsafe when untrusted actors concurrently operate on production data: upstream-downstream mismatches surface only at runtime, and multi-table pipelines can leak partial effects. Inspired by software engineering, we design Bauplan, a code-first lakehouse that aims to make (most) illegal states unrepresentable using familiar abstractions. Bauplan acts along three axes: typed table contracts to make pipeline boundaries checkable, Git-like data versioning for review and reproducibility, and transactional runs that guarantee pipeline-level atomicity. We report early results from a lightweight formal transaction model and discuss future work motivated by counterexamples.

14.9IRJul 20, 2020

Fantastic Embeddings and How to Align Them: Zero-Shot Inference in a Multi-Shop Scenario

Federico Bianchi, Jacopo Tagliabue, Bingqing Yu et al.

This paper addresses the challenge of leveraging multiple embedding spaces for multi-shop personalization, proving that zero-shot inference is possible by transferring shopping intent from one website to another without manual intervention. We detail a machine learning pipeline to train and optimize embeddings within shops first, and support the quantitative findings with additional qualitative insights. We then turn to the harder task of using learned embeddings across shops: if products from different shops live in the same vector space, user intent - as represented by regions in this space - can then be transferred in a zero-shot fashion across websites. We propose and benchmark unsupervised and supervised methods to "travel" between embedding spaces, each with its own assumptions on data quantity and quality. We show that zero-shot personalization is indeed possible at scale by testing the shared embedding space with two downstream tasks, event prediction and type-ahead suggestions. Finally, we curate a cross-shop anonymized embeddings dataset to foster an inclusive discussion of this important business scenario.

9.2IRJun 30, 2019

Prediction is very hard, especially about conversion. Predicting user purchases from clickstream data in fashion e-commerce

Luca Bigon, Giovanni Cassani, Ciro Greco et al.

Knowing if a user is a buyer vs window shopper solely based on clickstream data is of crucial importance for ecommerce platforms seeking to implement real-time accurate NBA (next best action) policies. However, due to the low frequency of conversion events and the noisiness of browsing data, classifying user sessions is very challenging. In this paper, we address the clickstream classification problem in the fashion industry and present three major contributions to the burgeoning field of AI in fashion: first, we collected, normalized and prepared a novel dataset of live shopping sessions from a major European e-commerce fashion website; second, we use the dataset to test in a controlled environment strong baselines and SOTA models from the literature; finally, we propose a new discriminative neural model that outperforms neural architectures recently proposed at Rakuten labs.