Detecting Quality Problems in Research Data: A Model-Driven Approach
This addresses data quality assurance for researchers, especially in humanities and cultural heritage, but is incremental as it builds on existing pattern-based methods.
The authors tackled the challenge of detecting quality problems in research data by developing a model-driven approach that abstracts from database technology, using patterns to locate issues; they implemented and evaluated the approach for XML databases in cultural heritage, showing it effectively identifies quality problems.
As scientific progress highly depends on the quality of research data, there are strict requirements for data quality coming from the scientific community. A major challenge in data quality assurance is to localise quality problems that are inherent to data. Due to the dynamic digitalisation in specific scientific fields, especially the humanities, different database technologies and data formats may be used in rather short terms to gain experiences. We present a model-driven approach to analyse the quality of research data. It allows abstracting from the underlying database technology. Based on the observation that many quality problems show anti-patterns, a data engineer formulates analysis patterns that are generic concerning the database format and technology. A domain expert chooses a pattern that has been adapted to a specific database technology and concretises it for a domain-specific database format. The resulting concrete patterns are used by data analysts to locate quality problems in their databases. As proof of concept, we implemented tool support that realises this approach for XML databases. We evaluated our approach concerning expressiveness and performance in the domain of cultural heritage based on a qualitative study on quality problems occurring in cultural heritage data.