LGAIDBJul 14, 2023

DataAssist: A Machine Learning Approach to Data Cleaning and Preparation

arXiv:2307.07119v220 citationsh-index: 3
AI Analysis

This addresses the data preparation bottleneck for users in fields like economics and business, offering a data-centric tool that is incremental by building on existing autoML concepts.

The paper tackles the problem of data cleaning and preparation, which consumes most of the time in data analysis, by introducing DataAssist, an automated platform that saves over 50% of the time spent on these tasks.

Current automated machine learning (ML) tools are model-centric, focusing on model selection and parameter optimization. However, the majority of the time in data analysis is devoted to data cleaning and wrangling, for which limited tools are available. Here we present DataAssist, an automated data preparation and cleaning platform that enhances dataset quality using ML-informed methods. We show that DataAssist provides a pipeline for exploratory data analysis and data cleaning, including generating visualization for user-selected variables, unifying data annotation, suggesting anomaly removal, and preprocessing data. The exported dataset can be readily integrated with other autoML tools or user-specified model for downstream analysis. Our data-centric tool is applicable to a variety of fields, including economics, business, and forecasting applications saving over 50% time of the time spent on data cleansing and preparation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes