DB SE APMar 8, 2019

Automated data validation: an industrial experience report

Lei Zhang, Sean Howard, Tom Montpool, Jessica Moore, Krittika Mahajan, Andriy Miranskyy

arXiv:1903.03676v21.29 citationsh-index: 26Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the lack of automation tools for data validation in industrial data science, though it is incremental by adapting existing software engineering practices.

The paper tackles the problem of automating data validation in data science by applying software engineering best practices, resulting in the development of RESTORE, an open-source R package that efficiently detects errors and reduces testing costs in industrial geodemographic data.

There has been a massive explosion of data generated by customers and retained by companies in the last decade. However, there is a significant mismatch between the increasing volume of data and the lack of automation methods and tools. The lack of best practices in data science programming may lead to software quality degradation, release schedule slippage, and budget overruns. To mitigate these concerns, we would like to bring software engineering best practices into data science. Specifically, we focus on automated data validation in the data preparation phase of the software development life cycle. This paper studies a real-world industrial case and applies software engineering best practices to develop an automated test harness called RESTORE. We release RESTORE as an open-source R package. Our experience report, done on the geodemographic data, shows that RESTORE enables efficient and effective detection of errors injected during the data preparation phase. RESTORE also significantly reduced the cost of testing. We hope that the community benefits from the open-source project and the practical advice based on our experience.

View on arXiv PDF Code

Similar