DBLGSEJul 10, 2024

Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans

arXiv:2407.07560v1h-index: 9
Originality Incremental advance
AI Analysis

This work addresses the need for data scientists to ensure correctness and reliability in ML pipelines, though it is incremental by building on existing library-based approaches.

The paper tackled the problem of automating the analysis and instrumentation of machine learning pipelines by extracting logical query plans from code, enabling automatic provenance tracking and what-if analyses without manual annotation. The result is a system that efficiently instruments static ML pipelines for screening data issues and supports advanced analyses through automated rewriting.

Machine Learning (ML) is increasingly used to automate impactful decisions, which leads to concerns regarding their correctness, reliability, and fairness. We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their ML pipelines. In contrast to existing work, our key idea is to extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code. First, we developed such an abstract ML pipeline representation together with machinery to extract it from Python code. Next, we used this representation to efficiently instrument static ML pipelines and apply provenance tracking, which enables lightweight screening for common data preparation issues. Finally, we built machinery to automatically rewrite ML pipelines to perform more advanced what-if analyses and proposed using multi-query optimisation for the resulting workloads. In future work, we aim to interactively assist data scientists as they work on their ML pipelines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes