Harald Foidl

2papers

2 Papers

SEMar 19, 2022
Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-based Systems

Harald Foidl, Michael Felderer, Rudolf Ramler

High data quality is fundamental for today's AI-based systems. However, although data quality has been an object of research for decades, there is a clear lack of research on potential data quality issues (e.g., ambiguous, extraneous values). These kinds of issues are latent in nature and thus often not obvious. Nevertheless, they can be associated with an increased risk of future problems in AI-based systems (e.g., technical debt, data-induced faults). As a counterpart to code smells in software engineering, we refer to such issues as Data Smells. This article conceptualizes data smells and elaborates on their causes, consequences, detection, and use in the context of AI-based systems. In addition, a catalogue of 36 data smells divided into three categories (i.e., Believability Smells, Understandability Smells, Consistency Smells) is presented. Moreover, the article outlines tool support for detecting data smells and presents the result of an initial smell detection on more than 240 real-world datasets.

SEMay 31, 2019
Technical Debt in Data-Intensive Software Systems

Harald Foidl, Michael Felderer, Stefan Biffl

The ever-increasing amount, variety as well as generation and processing speed of today's data pose a variety of new challenges for developing Data-Intensive Software Systems (DISS). As with developing other kinds of software systems, developing DISS is often done under severe pressure and strict schedules. Thus, developers of DISS often have to make technical compromises to meet business concerns. This position paper proposes a conceptual model that outlines where Technical Debt (TD) can emerge and proliferate within such data-centric systems by separating a DISS into three parts (Software Systems, Data Storage Systems and Data). Further, the paper illustrates the proliferation of Database Schema Smells as TD items within a relational database-centric software system based on two examples.