Klaus Schmid

SE
h-index2
14papers
71citations
Novelty32%
AI Score43

14 Papers

SESep 22, 2023
Towards an MLOps Architecture for XAI in Industrial Applications

Leonhard Faubel, Thomas Woudsma, Leila Methnani et al.

Machine learning (ML) has become a popular tool in the industrial sector as it helps to improve operations, increase efficiency, and reduce costs. However, deploying and managing ML models in production environments can be complex. This is where Machine Learning Operations (MLOps) comes in. MLOps aims to streamline this deployment and management process. One of the remaining MLOps challenges is the need for explanations. These explanations are essential for understanding how ML models reason, which is key to trust and acceptance. Better identification of errors and improved model accuracy are only two resulting advantages. An often neglected fact is that deployed models are bypassed in practice when accuracy and especially explainability do not meet user expectations. We developed a novel MLOps software architecture to address the challenge of integrating explanations and feedback capabilities into the ML development and deployment processes. In the project EXPLAIN, our architecture is implemented in a series of industrial use cases. The proposed MLOps software architecture has several advantages. It provides an efficient way to manage ML models in production environments. Further, it allows for integrating explanations into the development and deployment processes.

SEApr 29Code
Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset

Adam Krafczyk, Klaus Schmid

In the research of automated program repair (APR), benchmark datasets consisting of known defects in combination with test suites that indicate the defects are of high importance. They allow for an evidence-based comparison of different APR approaches. In our own work on APR we found significant challenges when working with widely used defect datasets, which go beyond mere repeatability of defects via test cases. We summarize these identified challenges and related lessons learned to bring them to the attention of the APR community and quantify the potential impact of them. In particular, we investigate the widely used benchmark Defects4J, which has according to Google Scholar over 1,800 citations. It consists of 835 defects from 17 open-source Java projects; a hand-curated collection of defects, test suites that clearly indicate the defect, and human patches where any unrelated changes are removed. We find that, when executing the test suites with strict requirements for reproducibility in APR settings (beyond merely reproducing the defect via test cases), 180 (21.6 %) of the defects are not suitable for evaluation experiments. Further, we find that an additional 59 (7.1 %) defects have test suites that are obviously under-specified, as deleting a single statement from the code base makes all test cases pass, although the human-written patch does not only delete code. Our contributions are: a systematic collection of requirements for defect datasets for APR beyond traditional reproducibility of defects, a description of practical experiences and quantitative analysis of problems with the Defects4J dataset, as well as an implementation of an evaluation framework for APR tools for Java programs. This evaluation framework does stricter checking for indications of inadequate test suites, to avoid otherwise unnoticed problems in the test suite, such as flaky tests.

SEOct 19, 2021Code
MetricHaven -- More Than 23,000 Metrics for Measuring Quality Attributes of Software Product Lines

Sascha El-Sharkawy, Adam Krafczyk, Klaus Schmid

Variability-aware metrics are designed to measure qualitative aspects of software product lines. As we identified in a prior SLR \cite{El-SharkawyYamagishi-EichlerSchmid19}, there exist already many metrics that address code or variability separately, while the combination of both has been less researched. MetricHaven fills this gap, as it extensively supports combining information from code files and variability models. Further, we also enable the combination of well established single system metrics with novel variability-aware metrics, going beyond existing variability-aware metrics. Our tool supports most prominent single system and variability-aware code metrics. We provide configuration support for already implemented metrics, resulting in 23,342 metric variations. Further, we present an abstract syntax tree developed for MetricHaven, that allows the realization of additional code metrics. Tool: https://github.com/KernelHaven/MetricHaven Video: https://youtu.be/vPEmD5Sr6gM

IRJan 30
RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation

Lorenz Brehme, Benedikt Dornauer, Jan-Henrik Böttcher et al.

Evaluating Retrieval-Augmented Generation (RAG) systems using static multi-turn datasets fails to capture the dynamic nature of real-world dialogues. Existing evaluation methods rely on predefined datasets, which restrict them to static, one-directional queries and limit their ability to capture the adaptive, context-dependent performance of RAG systems in interactive, multi-turn settings. Thus, we introduce the RAG-DIVE, a Dynamic Interactive Validation and Evaluation approach, that simulates user interactions with RAG systems. RAG-DIVE leverages an LLM to generate multi-turn conversations dynamically and is organized into three components. The dialogue generation stage consists of the (1) Conversation Generator, which simulates a user by creating multi-turn queries, and the (2) Conversation Validator, which filters and corrects invalid or low-quality outputs to ensure coherent conversations. The evaluation stage is handled by the (3) Conversation Evaluator, which assesses the RAG system's performance across the entire dialogue and generates both per-turn and multi-turn metrics that provide an aggregated view of system behavior. We validated RAG-DIVE through two experimental setups. First, we tested a sample RAG system, including human evaluation of dialogue quality, repeated trials to assess consistency, and an ablation study showing that RAG-DIVE detects performance changes caused by system modifications. Second, we compared RAG-DIVE with a traditional static dataset evaluation on an industrial RAG system under different configurations to verify whether both approaches reveal similar performance trends. Our findings demonstrate that RAG-DIVE facilitates dynamic, interaction-driven evaluation for multi-turn conversations, thereby advancing the assessment of RAG systems.

SENov 3, 2025
The Future of Generative AI in Software Engineering: A Vision from Industry and Academia in the European GENIUS Project

Robin Gröpler, Steffen Klepke, Jack Johns et al.

Generative AI (GenAI) has recently emerged as a groundbreaking force in Software Engineering, capable of generating code, identifying bugs, recommending fixes, and supporting quality assurance. While its use in coding tasks shows considerable promise, applying GenAI across the entire Software Development Life Cycle (SDLC) has not yet been fully explored. Critical uncertainties in areas such as reliability, accountability, security, and data privacy demand deeper investigation and coordinated action. The GENIUS project, comprising over 30 European industrial and academic partners, aims to address these challenges by advancing AI integration across all SDLC phases. It focuses on GenAI's potential, the development of innovative tools, and emerging research challenges, actively shaping the future of software engineering. This vision paper presents a shared perspective on the future of GenAI-driven software engineering, grounded in cross-sector dialogue as well as experiences and findings within the GENIUS consortium. The paper explores four central elements: (1) a structured overview of current challenges in GenAI adoption across the SDLC; (2) a forward-looking vision outlining key technological and methodological advances expected over the next five years; (3) anticipated shifts in the roles and required skill sets of software professionals; and (4) the contribution of GENIUS in realising this transformation through practical tools and industrial validation. This paper focuses on aligning technical innovation with business relevance. It aims to inform both research agendas and industrial strategies, providing a foundation for reliable, scalable, and industry-ready GenAI solutions for software engineering teams.

SEOct 25, 2021
Improving Software Engineering Research through Experimentation Workbenches

Klaus Schmid, Sascha El-Sharkawy, Christian Kröher

Experimentation with software prototypes plays a fundamental role in software engineering research. In contrast to many other scientific disciplines, however, explicit support for this key activity in software engineering is relatively small. While some approaches to improve this situation have been proposed by the software engineering community, experiments are still very difficult and sometimes impossible to replicate. In this paper, we propose the concept of an experimentation workbench as a means of explicit support for experimentation in software engineering research. In particular, we discuss core requirements that an experimentation workbench should satisfy in order to qualify as such and to offer a real benefit for researchers. Beyond their core benefits for experimentation, we stipulate that experimentation workbenches will also have benefits in regard to reproducibility and repeatability of software engineering research. Further, we illustrate this concept with a scenario and a case study, and describe relevant challenges as well as our experience with experimentation workbenches.

SEOct 19, 2021
KernelHaven -- An Open Infrastructure for Product Line Analysis

Christian Kröher, Sascha El-Sharkawy, Klaus Schmid

KernelHaven is an open infrastructure for Software Product Line (SPL) analysis. It is intended both as a production-quality analysis tool set as well as a research support tool, e.g., to support researchers in systematically exploring research hypothesis. For flexibility and ease of experimentation KernelHaven components are plug-ins for extracting certain information from SPL artifacts and processing this information, e.g., to check the correctness and consistency of variability information or to apply metrics. A configuration-based setup along with automatic documentation functionality allows different experiments and supports their easy reproduction. Here, we describe KernelHaven as a product line analysis research tool and highlight its basic approach as well as its fundamental capabilities. In particular, we describe available information extraction and processing plug-ins and how to combine them. On this basis, researchers and interested professional users can rapidly conduct a first set of experiments. Further, we describe the concepts for extending KernelHaven by new plug-ins, which reduces development effort when realizing new experiments.

SEOct 12, 2021
Fast Static Analyses of Software Product Lines -- An Example With More Than 42,000 Metrics

Sascha El-Sharkawy, Adam Krafczyk, Klaus Schmid

Context: Software metrics, as one form of static analyses, is a commonly used approach in software engineering in order to understand the state of a software system, in particular to identify potential areas prone to defects. Family-based techniques extract variability information from code artifacts in Software Product Lines (SPLs) to perform static analysis for all available variants. Many different types of metrics with numerous variants have been defined in literature. When counting all metrics including such variants, easily thousands of metrics can be defined. Computing all of them for large product lines can be an extremely expensive process in terms of performance and resource consumption. Objective: We address these performance and resource challenges while supporting customizable metric suites, which allow running both, single system and variability-aware code metrics. Method: In this paper, we introduce a partial parsing approach used for the efficient measurement of more than 42,000 code metric variations. The approach covers variability information and restricts parsing to the relevant parts of the Abstract Syntax Tree (AST). Conclusions: This partial parsing approach is designed to cover all relevant information to compute a broad variety of variability-aware code metrics on code artifacts containing annotation-based variability, e.g., realized with C-preprocessor statements. It allows for the flexible combination of single system and variability-aware metrics, which is not supported by existing tools. This is achieved by a novel representation of partially parsed product line code artifacts, which is tailored to the computation of the metrics. Our approach consumes considerably less resources, especially when computing many metric variants in parallel.

SEOct 12, 2021
Reverse Engineering Code Dependencies: Converting Integer-Based Variability to Propositional Logic

Adam Krafczyk, Sascha El-Sharkawy, Klaus Schmid

A number of SAT-based analysis concepts and tools for software product lines exist, that extract code dependencies in propositional logic from the source code assets of the product line. On these extracted conditions, SAT-solvers are used to reason about the variability. However, in practice, a lot of software product lines use integer-based variability. The variability variables hold integer values, and integer operators are used in the conditions. Most existing analysis tools can not handle this kind of variability; they expect pure Boolean conditions. This paper introduces an approach to convert integer-based variability conditions to propositional logic. Running this approach as a preparation on an integer-based product line allows the existing SAT-based analyses to work without any modifications. The pure Boolean formulas, that our approach builds as a replacement for the integer-based conditions, are mostly equivalent to the original conditions with respect to satisfiability. Our approach was motivated by and implemented in the context of a real-world industrial case-study, where such a preparation was necessary to analyze the variability. Our contribution is an approach to convert conditions, that use integer variables, into propositional formulas, to enable easy usage of SAT-solvers on the result. It works well on restricted variables (i.e. variables with a small range of allowed values); unrestricted integer variables are handled less exact, but still retain useful variability information.

SEOct 12, 2021
Reverse Engineering Variability in an Industrial Product Line: Observations and Lessons Learned

Sascha El-Sharkawy, Dhar Saura Jyoti, Adam Krafczyk et al.

Ideally, a variability model is a correct and complete representation of product line features and constraints among them. Together with a mapping between features and code, this ensures that only valid products can be configured and derived. However, in practice the modeled constraints might be neither complete nor correct, which causes problems in the configuration and product derivation phases. This paper presents an approach to reverse engineer variability constraints from the implementation, and thus improve the correctness and completeness of variability models. We extended the concept of feature effect analysis to extract variability constraints from code artifacts of the Bosch PS-EC large-scale product line. We present an industrial application of the approach and discuss its required modifications to handle non-Boolean variability and heterogeneous artifact types.

SEOct 12, 2021
KernelHaven -- An Experimentation Workbench for Analyzing Software Product Lines

Christian Kröher, Sascha El-Sharkawy, Klaus Schmid

Systematic exploration of hypotheses is a major part of any empirical research. In software engineering, we often produce unique tools for experiments and evaluate them independently on different data sets. In this paper, we present KernelHaven as an experimentation workbench supporting a significant number of experiments in the domain of static product line analysis and verification. It addresses the need for extracting information from a variety of artifacts in this domain by means of an open plug-in infrastructure. Available plug-ins encapsulate existing tools, which can now be combined efficiently to yield new analyses. As an experimentation workbench, it provides configuration-based definitions of experiments, their documentation, and technical services, like parallelization and caching. Hence, researchers can abstract from technical details and focus on the algorithmic core of their research problem. KernelHaven supports different types of analyses, like correctness checks, metrics, etc., in its specific domain. The concepts presented in this paper can also be transferred to support researchers of other software engineering domains. The infrastructure is available under Apache 2.0: https://github.com/KernelHaven. The plug-ins are available under their individual licenses.

SEOct 12, 2021
An Empirical Study of Configuration Mismatches in Linux

Sascha El-Sharkawy, Adam Krafczyk, Klaus Schmid

Ideally the variability of a product line is represented completely and correctly by its variability model. However, in practice additional variability is often represented on the level of the build system or in the code. Such a situation may lead to inconsistencies, where the actually realized variability does not fully correspond to the one described by the variability model. In this paper we focus on configuration mismatches, i.e., cases where the effective variability differs from the variability as it is represented by the variability model. While previous research has already shown that these situations still exist even today in well-analyzed product lines like Linux, so far it was unclear under what circumstances such issues occur in reality. In particular, it is open what types of configuration mismatches occur and how severe they are. Here, our contribution is to close this gap by presenting a detailed manual analysis of 80 configuration mismatches in the Linux 4.4.1 kernel and assess their criticality. We identify various categories of configuration issues and show that about two-thirds of the configuration mismatches may actually lead to kernel misconfigurations.

SEJun 17, 2021
Elicitation of Adaptive Requirements Using Creativity Triggers: A Controlled Experiment

Fabian Kneer, Erik Kamsties, Klaus Schmid

Adaptive systems react to changes in their environment by changing their behavior. Identifying these needed adaptations is very difficult, but central to requirements elicitation for adaptive systems. As the necessary or potential adaptations are typically not obvious to the stakeholders, the problem is how to effectively elicit adaptation-relevant information. One approach is to use creativity techniques to support the systematic identification and elicitation of adaptation requirements. In particular, here, we analyze a set of creativity triggers defined for systematic exploration of potential adaptation requirements. We compare these triggers with brainstorming as a baseline in a controlled experiment with 85 master students. The results indicate that the proposed triggers are suitable for the efficient elicitation of adaptive requirements and that the 15 trigger questions produce significantly more requirements fragments than solo brainstorming.

SENov 16, 2020
Environment Modeling for Adaptive Systems: A Systematic Literature Review

Fabian Kneer, Erik Kamsties, Klaus Schmid

[Context & Motivation] Adaptive systems are an important research area. The dominant reason for adaptivity in systems are changes in the environment. Thus, it is an important question how to model the environment and how to determine the necessary information on this environment in the requirements engineering phase. [Question/ Problem] There is so far relatively little explicit study of the notion of environment models in software engineering research. [Principal ideas/ Results] In this paper, we present a systematic literature review with the goal to determine the state of the art in environment modeling for adaptive systems, in particular from a requirements perspective. We discuss the goals of the approaches, the modeling concepts, as well as the methodology aspects of environment modeling in our survey. [Contribution] As major result of our survey, we provide a meta-model of existing environment modeling concepts. As a negative finding - and a research opportunity - we find that so far methodological aspects of environment modeling have received very little attention.