Dmytro Humeniuk

10.7LGOct 3, 2023Code

Data Cleaning and Machine Learning: A Systematic Literature Review

Pierre-Olivier Côté, Amin Nikanjam, Nafisa Ahmed et al.

Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches to detect and repair data errors (i.e., data cleaning). Researchers are also exploring how ML can be used for data cleaning; hence creating a dual relationship between ML and data cleaning. To the best of our knowledge, there is no study that comprehensively reviews this relationship. Objective: This paper's objectives are twofold. First, it aims to summarize the latest approaches for data cleaning for ML and ML for data cleaning. Second, it provides future work recommendations. Method: We conduct a systematic literature review of the papers published between 2016 and 2022 inclusively. We identify different types of data cleaning activities with and for ML: feature cleaning, label cleaning, entity matching, outlier detection, imputation, and holistic data cleaning. Results: We summarize the content of 101 papers covering various data cleaning activities and provide 24 future work recommendations. Our review highlights many promising data cleaning techniques that can be further extended. Conclusion: We believe that our review of the literature will help the community develop better approaches to clean data.

3.8CRFeb 23, 2021

Data Driven Testing of Cyber Physical Systems

Dmytro Humeniuk, Giuliano Antoniol, Foutse Khomh

Consumer grade cyber-physical systems (CPS) are becoming an integral part of our life, automatizing and simplifying everyday tasks. Indeed, due to complex interactions between hardware, networking and software, developing and testing such systems is known to be a challenging task. Various quality assurance and testing strategies have been proposed. The most common approach for pre-deployment testing is to model the system and run simulations with models or software in the loop. In practice, most often, tests are run for a small number of simulations, which are selected based on the engineers' domain knowledge and experience. In this paper we propose an approach to automatically generate fault-revealing test cases for CPS. We have implemented our approach in Python, using standard frameworks and used it to generate scenarios violating temperature constraints for a smart thermostat implemented as a part of our IoT testbed. Data collected from an application managing a smart building have been used to learn models of the environment under ever changing conditions. The suggested approach allowed us to identify several pit-fails, scenarios (i.e., environment conditions and inputs), where the system behaves not as expected.

Dmytro Humeniuk

2 Papers