SE AIMar 19, 2022

Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-based Systems

Harald Foidl, Michael Felderer, Rudolf Ramler

arXiv:2203.10384v35.944 citationsh-index: 39

Originality Incremental advance

AI Analysis

This addresses data quality problems for AI developers and researchers, but it is incremental as it builds on existing data quality research by introducing a new conceptual framework.

The paper tackles the problem of latent data quality issues in AI-based systems by conceptualizing 'Data Smells' as a counterpart to code smells, presenting a catalogue of 36 smells across three categories and detecting them in over 240 real-world datasets.

High data quality is fundamental for today's AI-based systems. However, although data quality has been an object of research for decades, there is a clear lack of research on potential data quality issues (e.g., ambiguous, extraneous values). These kinds of issues are latent in nature and thus often not obvious. Nevertheless, they can be associated with an increased risk of future problems in AI-based systems (e.g., technical debt, data-induced faults). As a counterpart to code smells in software engineering, we refer to such issues as Data Smells. This article conceptualizes data smells and elaborates on their causes, consequences, detection, and use in the context of AI-based systems. In addition, a catalogue of 36 data smells divided into three categories (i.e., Believability Smells, Understandability Smells, Consistency Smells) is presented. Moreover, the article outlines tool support for detecting data smells and presents the result of an initial smell detection on more than 240 real-world datasets.

View on arXiv PDF

Similar