SemPipes -- Optimizable Semantic Data Operators for Tabular Machine Learning Pipelines
This addresses the problem of high expertise and effort required for tabular ML pipeline design, offering a novel automation approach that is incremental in leveraging existing LLM capabilities.
The paper tackles the challenge of designing complex data preparation pipelines for tabular machine learning by introducing SemPipes, a declarative programming model that uses LLM-powered semantic operators to automate and optimize data transformations, resulting in improved predictive performance and reduced pipeline complexity across diverse tasks.
Real-world machine learning on tabular data relies on complex data preparation pipelines for prediction, data integration, augmentation, and debugging. Designing these pipelines requires substantial domain expertise and engineering effort, motivating the question of how large language models (LLMs) can support tabular ML through code synthesis. We introduce SemPipes, a novel declarative programming model that integrates LLM-powered semantic data operators into tabular ML pipelines. Semantic operators specify data transformations in natural language while delegating execution to a runtime system. During training, SemPipes synthesizes custom operator implementations based on data characteristics, operator instructions, and pipeline context. This design enables the automatic optimization of data operations in a pipeline via LLM-based code synthesis guided by evolutionary search. We evaluate SemPipes across diverse tabular ML tasks and show that semantic operators substantially improve end-to-end predictive performance for both expert-designed and agent-generated pipelines, while reducing pipeline complexity. We implement SemPipes in Python and release it at https://github.com/deem-data/sempipes/tree/v1.