DBCLDec 9, 2024

AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

arXiv:2412.06724v325 citationsh-index: 27EMNLP
Originality Incremental advance
AI Analysis

This work addresses the time-consuming and error-prone manual process of data cleaning for data analysts, though it appears incremental as it builds on existing LLM capabilities and OpenRefine tools.

The paper tackles the problem of automating data cleaning workflows by introducing AutoDCWorkflow, an LLM-based pipeline that generates sequences of OpenRefine operations to produce clean tables for specific analysis purposes, with experiments showing that models like Gemma 2-27B significantly enhance data quality and outperform baselines across metrics.

Data cleaning is a time-consuming and error-prone manual process, even with modern workflow tools such as OpenRefine. We present AutoDCWorkflow, an LLM-based pipeline for automatically generating data-cleaning workflows. The pipeline takes a raw table and a data analysis purpose, and generates a sequence of OpenRefine operations designed to produce a minimal, clean table sufficient to address the purpose. Six operations correspond to common data quality issues, including format inconsistencies, type errors, and duplicates. To evaluate AutoDCWorkflow, we create a benchmark with metrics assessing answers, data, and workflow quality for 142 purposes using 96 tables across six topics. The evaluation covers three key dimensions: (1) Purpose Answer: can the cleaned table produce a correct answer? (2) Column (Value): how closely does it match the ground truth table? (3) Workflow (Operations): to what extent does the generated workflow resemble the human-curated ground truth? Experiments show that Llama 3.1, Mistral, and Gemma 2 significantly enhance data quality, outperforming the baseline across all metrics. Gemma 2-27B consistently generates high-quality tables and answers, while Gemma 2-9B excels in producing workflows that closely resemble human-annotated versions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes