LGAIDBSEFeb 5, 2025

Data Wrangling Task Automation Using Code-Generating Language Models

arXiv:2502.15732v12 citationsh-index: 4AAAI
Originality Synthesis-oriented
AI Analysis

This addresses data wrangling automation for users dealing with large datasets, though it appears incremental by applying existing LLMs to a specific domain.

The authors tackled the challenge of ensuring data quality in large tabular datasets by developing an automated system that uses large language models to generate executable code for tasks like missing value imputation and error correction, achieving effective handling of both memory-dependent and memory-independent tasks.

Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning approaches are resource-intensive, requiring task and dataset-specific training. To overcome these shortcomings, we present an automated system that utilizes large language models to generate executable code for tasks like missing value imputation, error detection, and error correction. Our system aims to identify inherent patterns in the data while leveraging external knowledge, effectively addressing both memory-dependent and memory-independent tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes