29.2DBApr 10Code
Evaluating Data Quality Tools: Measurement Capabilities and LLM IntegrationTobias Rehberger, Thomas Hütter, Lisa Ehrlinger et al.
High data quality is critical for reliable analytics and operational efficiency. A growing ecosystem of tools has emerged to support data quality management, ranging from lightweight open-source libraries to comprehensive enterprise platforms. This paper evaluates six data quality tools: Great Expectations, Deequ, Evidently, Informatica, Experian, and Ataccama. The evaluation criteria cover rule definition, duplicate detection, metric aggregation, and uncertainty handling, and were derived from real-world use cases of company partners. We further examine to what extent these tools integrate Large Language Models (LLMs). Our findings show that proprietary tools offer more comprehensive measurement features and emerging LLM-based assistance, while open-source tools provide flexibility at the cost of higher implementation effort. Across all tools, LLM integration remains limited to rule creation workflows. Direct data validation through LLMs is not yet supported by any of the evaluated tools.
11.1DBApr 10
A Catalog of Data ErrorsDivya Bhadauria, Hazar Harmouch, Felix Naumann et al.
Data errors are widespread in real-world databases and severely impact downstream applications, such as machine learning pipelines or business analytics reports. Causes of such errors are manifold and can arise during both the design phase and the operational phase of a database. Some error types, such as missing values, duplicate tuples, or constraint violations, are widely recognized; others, such as disguised missing values or word transpositions, remain underexplored. Existing attempts to define and classify errors in data offer valuable but limited taxonomies, mostly informal and not covering the full range of error types. With the rise of AI, practitioners must increasingly detect and correct statistical errors such as bias and outliers, which are rarely considered within existing error taxonomies. This catalog presents a comprehensive list of 35 distinct error types, including both data errors (e.g., missing values, duplicate tuples) and error indicators (e.g., outliers, bias) for tabular data, classified into three non-overlapping categories: missing, incorrect, and redundant. For each error type, we provide a formal definition and practical example, and resolve terminological inconsistencies across related work. Our catalog enables researchers and practitioners to address various error types and systematically implement error-specific detection and cleaning strategies in data quality tools.
AIDec 16, 2020
Knowledge Graphs in Manufacturing and Production: A Systematic Literature ReviewGeorg Buchgeher, David Gabauer, Jorge Martinez-Gil et al.
Knowledge graphs in manufacturing and production aim to make production lines more efficient and flexible with higher quality output. This makes knowledge graphs attractive for companies to reach Industry 4.0 goals. However, existing research in the field is quite preliminary, and more research effort on analyzing how knowledge graphs can be applied in the field of manufacturing and production is needed. Therefore, we have conducted a systematic literature review as an attempt to characterize the state-of-the-art in this field, i.e., by identifying exiting research and by identifying gaps and opportunities for further research. To do that, we have focused on finding the primary studies in the existing literature, which were classified and analyzed according to four criteria: bibliometric key facts, research type facets, knowledge graph characteristics, and application scenarios. Besides, an evaluation of the primary studies has also been carried out to gain deeper insights in terms of methodology, empirical evidence, and relevance. As a result, we can offer a complete picture of the domain, which includes such interesting aspects as the fact that knowledge fusion is currently the main use case for knowledge graphs, that empirical research and industrial application are still missing to a large extent, that graph embeddings are not fully exploited, and that technical literature is fast-growing but seems to be still far from its peak.
DBJul 18, 2019
A Survey of Data Quality Measurement and Monitoring ToolsLisa Ehrlinger, Elisa Rusz, Wolfram Wöß
High-quality data is key to interpretable and trustworthy data analytics and the basis for meaningful data-driven decisions. In practical scenarios, data quality is typically associated with data preprocessing, profiling, and cleansing for subsequent tasks like data integration or data analytics. However, from a scientific perspective, a lot of research has been published about the measurement (i.e., the detection) of data quality issues and different generally applicable data quality dimensions and metrics have been discussed. In this work, we close the gap between research into data quality measurement and practical implementations by investigating the functional scope of current data quality tools. With a systematic search, we identified 667 software tools dedicated to "data quality", from which we evaluated 13 tools with respect to three functionality areas: (1) data profiling, (2) data quality measurement in terms of metrics, and (3) continuous data quality monitoring. We selected the evaluated tools with regard to pre-defined exclusion criteria to ensure that they are domain-independent, provide the investigated functions, and are evaluable freely or as trial. This survey aims at a comprehensive overview on state-of-the-art data quality tools and reveals potential for their functional enhancement. Additionally, the results allow a critical discussion on concepts, which are widely accepted in research, but hardly implemented in any tool observed, for example, generally applicable data quality metrics.