11.8PLMay 11
CPSLint: A Domain-Specific Language Providing Data Validation and Sanitisation for Industrial Cyber-Physical SystemsUraz Odyurt, Ömer Sayilir, Mariëlle Stoelinga et al.
Industrial cyber-physical systems generate vast amounts of semi-structured time-series data that require careful preprocessing before they can be effectively used for machine learning applications such as fault detection and identification. Raw sensor datasets are often corrupted or incomplete, making it challenging to develop reliable solutions without proper data preparation and validation. In this paper, we introduce CPSLint, a domain-specific language for data validation and sanitisation. We present the design, implementation and evaluation of CPSLint, demonstrating its ability to automatically detect and correct common data corruption patterns while enabling non-programming domain experts to effectively prepare their data for analysis. We report evaluation results on a representative dataset, tracking memory consumption and CPU-time for sanitisation activities. Our approach offers several advantages over traditional methods, including reduced manual effort, guaranteed consistency and broader applicability across time-series datasets and projects.
2.0PLApr 20
Implementing CPSLint: A Data Validation and Sanitisation Tool for Industrial Cyber-Physical SystemsUraz Odyurt, Ömer Sayilir, Mariëlle Stoelinga et al.
Raw datasets are often too large and unstructured to work with directly, and require a data preparation phase. The domain of industrial Cyber-Physical Systems (CPSs) is no exception, as raw data typically consists of large time-series data collections that log the system's status at regular time intervals. The processing of such raw data is often carried out using ad hoc, case-specific, one-off Python scripts, often neglecting aspects of readability, reusability, and maintainability. In practice, this can cause professionals such as data scientists to write similar data preparation scripts for each case, requiring them to do much repetitive work. We introduce CPSLint, a Domain-Specific Language (DSL) designed to support the data preparation process for industrial CPS. CPSLint raises the level of abstraction to the point where both data scientists and domain experts can perform the data preparation task. We leverage the fact that many raw data collections in the industrial CPS domain require similar actions to render them suitable for data-centric workflows. In our DSL one can express the data preparation process in just a few lines of code. CPSLint is a publicly available tool applicable for any case involving time-series data collections in need of sanitisation.
SEMar 31, 2017Code
Does Python Smell Like Java? Tool Support for Design Defect Discovery in PythonNicole Vavrová, Vadim Zaytsev
The context of this work is specification, detection and ultimately removal of detectable harmful patterns in source code that are associated with defects in design and implementation of software. In particular, we investigate five code smells and four antipatterns previously defined in papers and books. Our inquiry is about detecting those in source code written in Python programming language, which is substantially different from all prior research, most of which concerns Java or C-like languages. Our approach was that of software engineers: we have processed existing research literature on the topic, extracted both the abstract definitions of nine design defects and their concrete implementation specifications, implemented them all in a tool we have programmed and let it loose on a huge test set obtained from open source code from thousands of GitHub projects. When it comes to knowledge, we have found that more than twice as many methods in Python can be considered too long (statistically extremely longer than their neighbours within the same project) than in Java, but long parameter lists are seven times less likely to be found in Python code than in Java code. We have also found that Functional Decomposition, the way it was defined for Java, is not found in the Python code at all, and Spaghetti Code and God Classes are extremely rare there as well. The grounding and the confidence in these results comes from the fact that we have performed our experiments on 32'058'823 lines of Python code, which is by far the largest test set for a freely available Python parser. We have also designed the experiment in such a way that it aligned with prior research on design defect detection in Java in order to ease the comparison if we treat our own actions as a replication. Thus, the importance of the work is both in the unique open Python grammar of highest quality, tested on millions of lines of code, and in the design defect detection tool which works on something else than Java.
AIJun 11, 2024
Mining Frequent Structures in Conceptual ModelsMattia Fumagalli, Tiago Prince Sales, Pedro Paulo F. Barcelos et al.
The problem of using structured methods to represent knowledge is well-known in conceptual modeling and has been studied for many years. It has been proven that adopting modeling patterns represents an effective structural method. Patterns are, indeed, generalizable recurrent structures that can be exploited as solutions to design problems. They aid in understanding and improving the process of creating models. The undeniable value of using patterns in conceptual modeling was demonstrated in several experimental studies. However, discovering patterns in conceptual models is widely recognized as a highly complex task and a systematic solution to pattern identification is currently lacking. In this paper, we propose a general approach to the problem of discovering frequent structures, as they occur in conceptual modeling languages. As proof of concept, we implement our approach by focusing on two widely-used conceptual modeling languages. This implementation includes an exploratory tool that integrates a frequent subgraph mining algorithm with graph manipulation techniques. The tool processes multiple conceptual models and identifies recurrent structures based on various criteria. We validate the tool using two state-of-the-art curated datasets: one consisting of models encoded in OntoUML and the other in ArchiMate. The primary objective of our approach is to provide a support tool for language engineers. This tool can be used to identify both effective and ineffective modeling practices, enabling the refinement and evolution of conceptual modeling languages. Furthermore, it facilitates the reuse of accumulated expertise, ultimately supporting the creation of higher-quality models in a given language.
SEMar 29, 2015
Guided Grammar ConvergenceVadim Zaytsev
Relating formal grammars is a hard problem that balances between language equivalence (which is known to be undecidable) and grammar identity (which is trivial). In this paper, we investigate several milestones between those two extremes and propose a methodology for inconsistency management in grammar engineering. While conventional grammar convergence is a practical approach relying on human experts to encode differences as transformation steps, guided grammar convergence is a more narrowly applicable technique that infers such transformation steps automatically by normalising the grammars and establishing a structural equivalence relation between them. This allows us to perform a case study with automatically inferring bidirectional transformations between 11 grammars (in a broad sense) of the same artificial functional language: parser specifications with different combinator libraries, definite clause grammars, concrete syntax definitions, algebraic data types, metamodels, XML schemata, object models.