Stefan Wagner

h-index47

118papers

3,871citations

Novelty25%

AI Score53

Ranked #32,112 of 201,326 authors (top 16%)#292 in SE (top 9%)

118 Papers

SEJun 1Code

Comparing ML-Specific and General Python Code Smells Across Project Characteristics

Halimeh Agh, Betül Cimendag, Stefan Wagner

Machine learning systems consist of general-purpose code as well as machine-learning-specific code. While ML-specific code smells have been identified, their connection to project characteristics and their interaction with overall code quality are not well understood. Without this knowledge, quality assurance strategies remain one-size-fits-all, failing to account for the contextual factors that drive technical debt in ML systems. We present empirical evidence by examining how six project features (size, age, contributors, commit frequency, CI/CD adoption, and domain) relate to both ML-specific and general Python code quality in 279 open-source ML projects on GitHub. Using CodeSmile for ML code smells and Pylint for general Python smells, our results show: (1) ML code smells are 41-94 times less frequent than general Python smells; (2) commit frequency and domain are significantly associated with ML-specific quality, while project size, team size, age, and CI/CD adoption are not, challenging traditional views on technical debt; (3) general Python smells are not linked to any project characteristic, indicating systemic coding issues that are independent of project context; (4) domains that suffer most from ML-specific smells are not necessarily the same domains that suffer most from general Python smells, necessitating tailored quality strategies for each smell type. MLOps often involves configuration issues, Reinforcement Learning faces challenges with tensor manipulation, and Computer Vision encounters problems with GPU workflows. Overall, ML code quality depends on domain-specific practices and specialized CI/CD quality gates, as standard automation often overlooks domain-specific correctness problems.

SEMay 10

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

Sebastian Baltes, Florian Angermeir, Chetan Arora et al.

Large Language Models (LLMs) are widely used in software engineering (SE) research and practice, yet their non-determinism, opaque training data, and rapidly evolving models threaten the reproducibility and replicability of empirical studies. We address this challenge through a collaborative effort of 22 researchers, presenting a taxonomy of seven study types that organizes how LLMs are used in SE research, together with eight guidelines for designing and reporting such studies. Each guideline distinguishes requirements (must) from recommended practices (should) and is contextualized by the study types it applies to. Our guidelines recommend that researchers: (1) declare LLM usage and role; (2) report model versions, configurations, and customizations; (3) document the tool architecture beyond the model; (4) disclose prompts, their development, and interaction logs; (5) validate LLM outputs with humans; (6) include an open LLM as a baseline; (7) use suitable baselines, benchmarks, and metrics; and (8) articulate limitations and mitigations. We complement the guidelines with an applicability matrix mapping guidelines to study types and a reporting checklist for authors and reviewers. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines$.$org).

SEJun 3, 2022

Can Requirements Engineering Support Explainable Artificial Intelligence? Towards a User-Centric Approach for Explainability Requirements

Umm-e-Habiba, Justus Bogner, Stefan Wagner

With the recent proliferation of artificial intelligence systems, there has been a surge in the demand for explainability of these systems. Explanations help to reduce system opacity, support transparency, and increase stakeholder trust. In this position paper, we discuss synergies between requirements engineering (RE) and Explainable AI (XAI). We highlight challenges in the field of XAI, and propose a framework and research directions on how RE practices can help to mitigate these challenges.

SEJul 4, 2022

The Present and Future of Bots in Software Engineering

Emad Shihab, Stefan Wagner, Marco A. Gerosa et al.

We are witnessing a massive adoption of software engineering bots, applications that react to events triggered by tools and messages posted by users and run automated tasks in response, in a variety of domains. This thematic issues describes experiences and challenges with these bots.

SESep 11, 2024

How Mature is Requirements Engineering for AI-based Systems? A Systematic Mapping Study on Practices, Challenges, and Future Research Directions

Umm-e- Habiba, Markus Haug, Justus Bogner et al.

Artificial intelligence (AI) permeates all fields of life, which resulted in new challenges in requirements engineering for artificial intelligence (RE4AI), e.g., the difficulty in specifying and validating requirements for AI or considering new quality requirements due to emerging ethical implications. It is currently unclear if existing RE methods are sufficient or if new ones are needed to address these challenges. Therefore, our goal is to provide a comprehensive overview of RE4AI to researchers and practitioners. What has been achieved so far, i.e., what practices are available, and what research gaps and challenges still need to be addressed? To achieve this, we conducted a systematic mapping study combining query string search and extensive snowballing. The extracted data was aggregated, and results were synthesized using thematic analysis. Our selection process led to the inclusion of 126 primary studies. Existing RE4AI research focuses mainly on requirements analysis and elicitation, with most practices applied in these areas. Furthermore, we identified requirements specification, explainability, and the gap between machine learning engineers and end-users as the most prevalent challenges, along with a few others. Additionally, we proposed seven potential research directions to address these challenges. Practitioners can use our results to identify and select suitable RE methods for working on their AI-based systems, while researchers can build on the identified gaps and research directions to push the field forward.

AIMay 26, 2022

Learning in Feedback-driven Recurrent Spiking Neural Networks using full-FORCE Training

Ankita Paul, Stefan Wagner, Anup Das

Feedback-driven recurrent spiking neural networks (RSNNs) are powerful computational models that can mimic dynamical systems. However, the presence of a feedback loop from the readout to the recurrent layer de-stabilizes the learning mechanism and prevents it from converging. Here, we propose a supervised training procedure for RSNNs, where a second network is introduced only during the training, to provide hint for the target dynamics. The proposed training procedure consists of generating targets for both recurrent and readout layers (i.e., for a full RSNN system). It uses the recursive least square-based First-Order and Reduced Control Error (FORCE) algorithm to fit the activity of each layer to its target. The proposed full-FORCE training procedure reduces the amount of modifications needed to keep the error between the output and target close to zero. These modifications control the feedback loop, which causes the training to converge. We demonstrate the improved performance and noise robustness of the proposed full-FORCE training procedure to model 8 dynamical systems using RSNNs with leaky integrate and fire (LIF) neurons and rate coding. For energy-efficient hardware implementation, an alternative time-to-first-spike (TTFS) coding is implemented for the full- FORCE training procedure. Compared to rate coding, full-FORCE with TTFS coding generates fewer spikes and facilitates faster convergence to the target dynamics.

HCDec 18, 2025

Virtual Reality User Interface Design: Best Practices and Implementation

Esin Mehmedova, Santiago Berrezueta-Guzman, Stefan Wagner

Designing effective user interfaces (UIs) for virtual reality (VR) is essential to enhance user immersion, usability, comfort, and accessibility in virtual environments. Despite the growing adoption of VR across domains, there is a noticeable lack of unified and comprehensive design guidelines for VR UI design. To address this gap, we conducted a systematic literature review to identify existing best practices and propose 28 unified guidelines for UI development in VR. Building on these insights, this research proposes a framework to guide the creation of more effective VR interfaces. To demonstrate and validate these practices, we developed a VR application called FlUId and an interactive Web Tool that serves as a guideline explorer and project planning resource for developers. A user study was conducted to evaluate the impact of the proposed guidelines. The findings aim to bridge the gap between theory and practice, offering concrete recommendations and digital tools for VR designers and developers.

AIJul 27, 2024

Interactive Learning in Computer Science Education Supported by a Discord Chatbot

Santiago Berrezueta-Guzman, Ivan Parmacli, Stephan Krusche et al.

Enhancing interaction and feedback collection in a first-semester computer science course poses a significant challenge due to students' diverse needs and engagement levels. To address this issue, we created and integrated a command-based chatbot on the course communication server on Discord. The DiscordBot enables students to provide feedback on course activities through short surveys, such as exercises, quizzes, and lectures, facilitating stress-free communication with instructors. It also supports attendance tracking and introduces lectures before they start. The research demonstrates the effectiveness of the DiscordBot as a communication tool. The ongoing feedback allowed course instructors to dynamically adjust and improve the difficulty level of upcoming activities and promote discussion in subsequent tutor sessions. The data collected reveal that students can accurately perceive the activities' difficulty and expected results, providing insights not possible through traditional end-of-semester surveys. Students reported that interaction with the DiscordBot was easy and expressed a desire to continue using it in future semesters. This responsive approach ensures the course meets the evolving needs of students, thereby enhancing their overall learning experience.

HCAug 1, 2025Code

How LLMs are Shaping the Future of Virtual Reality

Süeda Özkaya, Santiago Berrezueta-Guzman, Stefan Wagner

The integration of Large Language Models (LLMs) into Virtual Reality (VR) games marks a paradigm shift in the design of immersive, adaptive, and intelligent digital experiences. This paper presents a comprehensive review of recent research at the intersection of LLMs and VR, examining how these models are transforming narrative generation, non-player character (NPC) interactions, accessibility, personalization, and game mastering. Drawing from an analysis of 62 peer reviewed studies published between 2018 and 2025, we identify key application domains ranging from emotionally intelligent NPCs and procedurally generated storytelling to AI-driven adaptive systems and inclusive gameplay interfaces. We also address the major challenges facing this convergence, including real-time performance constraints, memory limitations, ethical risks, and scalability barriers. Our findings highlight that while LLMs significantly enhance realism, creativity, and user engagement in VR environments, their effective deployment requires robust design strategies that integrate multimodal interaction, hybrid AI architectures, and ethical safeguards. The paper concludes by outlining future research directions in multimodal AI, affective computing, reinforcement learning, and open-source development, aiming to guide the responsible advancement of intelligent and inclusive VR systems.

CYMar 19

Beyond the Code: A Multi-Modal Assessment Strategy for Fostering Professional Competencies via Introductory Programming Projects

Santiago Berrezueta-Guzman, Vanesa Metaj, Stefan Wagner

As the landscape of software engineering evolves, introductory programming courses must go beyond teaching syntax to foster comprehensive technical competencies and professional soft skills. This paper reports on a pedagogical experience in a "Fundamentals of Programming" course that used a Project-Based Learning (PBL) framework to develop a 2D "Maze Runner"-style game. While game development serves as a high-engagement vehicle for mastering core concepts, such as multidimensional arrays, control structures, and logic, the core of this study focuses on implementing a rigorous, multifaceted assessment model structured across four distinct dimensions: (1) an in-situ technical demonstration, evaluating real-time code execution and algorithmic robustness; (2) a technical screencast, requiring students to articulate their work in a concise audiovisual format; (3) a formal presentation to instructors, defending their project's design patterns and problem-solving strategies; and (4) a structured peer-review process, where students evaluated their colleagues' projects. Our findings suggest that this multi-dimensional approach not only improves student retention of programming fundamentals but also significantly enhances communication skills and critical thinking. By integrating peer evaluation and multimedia documentation, the course successfully bridges the gap between basic coding and the collaborative requirements of modern software engineering. This paper details the curriculum design, the challenges of implementing diverse assessment pillars, and the measurable impact on student performance and engagement, providing a scalable roadmap for educators looking to modernize introductory computing curricula.

SEApr 13, 2019Code

Open Science in Software Engineering

Daniel Méndez Fernández, Daniel Graziotin, Stefan Wagner et al.

Open science describes the movement of making any research artefact available to the public and includes, but is not limited to, open access, open data, and open source. While open science is becoming generally accepted as a norm in other scientific disciplines, in software engineering, we are still struggling in adapting open science to the particularities of our discipline, rendering progress in our scientific community cumbersome. In this chapter, we reflect upon the essentials in open science for software engineering including what open science is, why we should engage in it, and how we should do it. We particularly draw from our experiences made as conference chairs implementing open science initiatives and as researchers actively engaging in open science to critically discuss challenges and pitfalls, and to address more advanced topics such as how and under which conditions to share preprints, what infrastructure and licence model to cover, or how do it within the limitations of different reviewing models, such as double-blind reviewing. Our hope is to help establishing a common ground and to contribute to make open science a norm also in software engineering.

SEMar 13, 2019Code

Is the Stack Distance Between Test Case and Method Correlated With Test Effectiveness?

Rainer Niedermayr, Stefan Wagner

Mutation testing is a means to assess the effectiveness of a test suite and its outcome is considered more meaningful than code coverage metrics. However, despite several optimizations, mutation testing requires a significant computational effort and has not been widely adopted in industry. Therefore, we study in this paper whether test effectiveness can be approximated using a more light-weight approach. We hypothesize that a test case is more likely to detect faults in methods that are close to the test case on the call stack than in methods that the test case accesses indirectly through many other methods. Based on this hypothesis, we propose the minimal stack distance between test case and method as a new test measure, which expresses how close any test case comes to a given method, and study its correlation with test effectiveness. We conducted an empirical study with 21 open-source projects, which comprise in total 1.8 million LOC, and show that a correlation exists between stack distance and test effectiveness. The correlation reaches a strength up to 0.58. We further show that a classifier using the minimal stack distance along with additional easily computable measures can predict the mutation testing result of a method with 92.9% precision and 93.4% recall. Hence, such a classifier can be taken into consideration as a light-weight alternative to mutation testing or as a preceding, less costly step to that.

SENov 2, 2018Code

Too Trivial To Test? An Inverse View on Defect Prediction to Identify Methods with Low Fault Risk

Rainer Niedermayr, Tobias Röhm, Stefan Wagner

Background. Test resources are usually limited and therefore it is often not possible to completely test an application before a release. To cope with the problem of scarce resources, development teams can apply defect prediction to identify fault-prone code regions. However, defect prediction tends to low precision in cross-project prediction scenarios. Aims. We take an inverse view on defect prediction and aim to identify methods that can be deferred when testing because they contain hardly any faults due to their code being "trivial". We expect that characteristics of such methods might be project-independent, so that our approach could improve cross-project predictions. Method. We compute code metrics and apply association rule mining to create rules for identifying methods with low fault risk. We conduct an empirical study to assess our approach with six Java open-source projects containing precise fault data at the method level. Results. Our results show that inverse defect prediction can identify approx. 32-44% of the methods of a project to have a low fault risk; on average, they are about six times less likely to contain a fault than other methods. In cross-project predictions with larger, more diversified training sets, identified methods are even eleven times less likely to contain a fault. Conclusions. Inverse defect prediction supports the efficient allocation of test resources by identifying methods that can be treated with less priority in testing activities and is well applicable in cross-project prediction scenarios.

CRJul 3, 2018Code

Usability and Security Effects of Code Examples on Crypto APIs - CryptoExamples: A platform for free, minimal, complete and secure crypto examples

Kai Mindermann, Stefan Wagner

Context: Cryptographic APIs are said to be not usable and researchers suggest to add example code to the documentation. Aim: We wanted to create a free platform for cryptographic code examples that improves the usability and security of created applications by non security experts. Method: We created the open-source web platform CryptoExamples and conducted a controlled experiment where 58 students added symmetric encryption to a Java program. We then measured the usability and security. Results: The participants who used the platform were not only significantly more effective (+73 %) but also their code contained significantly less possible security vulnerabilities (-66 %). Conclusions: With CryptoExamples the gap between hard to change API documentation and the need for complete and secure code examples can be closed. Still, the platform needs more code examples.

SEJun 12, 2018Code

Evaluating Maintainability Prejudices with a Large-Scale Study of Open-Source Projects

Tobias Roehm, Daniel Veihelmann, Stefan Wagner et al.

Exaggeration or context changes can render maintainability experience into prejudice. For example, JavaScript is often seen as least elegant language and hence of lowest maintainability. Such prejudice should not guide decisions without prior empirical validation. We formulated 10 hypotheses about maintainability based on prejudices and test them in a large set of open-source projects (6,897 GitHub repositories, 402 million lines, 5 programming languages). We operationalize maintainability with five static analysis metrics. We found that JavaScript code is not worse than other code, Java code shows higher maintainability than C# code and C code has longer methods than other code. The quality of interface documentation is better in Java code than in other code. Code developed by teams is not of higher and large code bases not of lower maintainability. Projects with high maintainability are not more popular or more often forked. Overall, most hypotheses are not supported by open-source data.

SEMay 3, 2018Code

Poster: Identification of Methods with Low Fault Risk

Rainer Niedermayr, Tobias Röhm, Stefan Wagner

Test resources are usually limited and therefore it is often not possible to completely test an application before a release. Therefore, testers need to focus their activities on the relevant code regions. In this paper, we introduce an inverse defect prediction approach to identify methods that contain hardly any faults. We applied our approach to six Java open-source projects and show that on average 31.6% of the methods of a project have a low fault risk; they contain in total, on average, only 5.8% of all faults. Furthermore, the results suggest that, unlike defect prediction, our approach can also be applied in cross-project prediction scenarios. Therefore, inverse defect prediction can help prioritize untested code areas and guide testers to increase the fault detection probability.

SEMar 26, 2018Code

Poster: Communication in Open-Source Projects--End of the E-mail Era?

Verena Käfer, Daniel Graziotin, Ivan Bogicevic et al.

Communication is essential in software engineering. Especially in distributed open-source teams, communication needs to be supported by channels including mailing lists, forums, issue trackers, and chat systems. Yet, we do not have a clear understanding of which communication channels stakeholders in open-source projects use. In this study, we fill the knowledge gap by investigating a statistically representative sample of 400 GitHub projects. We discover the used communication channels by regular expressions on project data. We show that (1) half of the GitHub projects use observable communication channels; (2) GitHub Issues, e-mail addresses, and the modern chat system Gitter are the most common channels; (3) mailing lists are only in place five and have a lower market share than all modern chat systems combined.

SEMar 13, 2017Code

Are Comprehensive Quality Models Necessary for Evaluating Software Quality?

Klaus Lochmann, Jasmin Ramadani, Stefan Wagner

The concept of software quality is very complex and has many facets. Reflecting all these facets and at the same time measuring everything related to these facets results in comprehensive but large quality models and extensive measurements. In contrast, there are also many smaller, focused quality models claiming to evaluate quality with few measures. We investigate if and to what extent it is possible to build a focused quality model with similar evaluation results as a comprehensive quality model but with far less measures needed to be collected and, hence, reduced effort. We make quality evaluations with the comprehensive Quamoco base quality model and build focused quality models based on the same set of measures and data from over 2,000 open source systems. We analyse the ability of the focused model to predict the results of the Quamoco model by comparing them with a random predictor as a baseline. We calculate the standardised accuracy measure SA and effect sizes. We found that for the Quamoco model and its 378 automatically collected measures, we can build a focused model with only 10 measures but an accuracy of 61% and a medium to high effect size. We conclude that we can build focused quality models to get an impression of a system's quality similar to comprehensive models. However, when including manually collected measures, the accuracy of the models stayed below 50%. Hence, manual measures seem to have a high impact and should therefore not be ignored in a focused model.

SEJan 19, 2017Code

Do Code Clones Matter?

Elmar Juergens, Florian Deissenboeck, Benjamin Hummel et al.

Code cloning is not only assumed to inflate maintenance costs but also considered defect-prone as inconsistent changes to code duplicates can lead to unexpected behavior. Consequently, the identification of duplicated code, clone detection, has been a very active area of research in recent years. Up to now, however, no substantial investigation of the consequences of code cloning on program correctness has been carried out. To remedy this shortcoming, this paper presents the results of a large-scale case study that was undertaken to find out if inconsistent changes to cloned code can indicate faults. For the analyzed commercial and open source systems we not only found that inconsistent changes to clones are very frequent but also identified a significant number of faults induced by such changes. The clone detection tool used in the case study implements a novel algorithm for the detection of inconsistent clones. It is available as open source to enable other researchers to use it as basis for further investigations.

SEDec 9, 2016Code

A Systematic and Semi-Automatic Safety-Based Test Case Generation Approach Based on Systems-Theoretic Process Analysis

Asim Abdulkhaleq, Stefan Wagner

Software safety is a crucial aspect during the development of modern safety-critical systems. Software is becoming responsible for most of the critical functions of systems. Therefore, the software components in the systems need to be tested extensively against their safety requirements to ensure a high level of system safety. However, performing testing exhaustively to test all software behaviours is impossible. Numerous testing approaches exist. However, they do not directly concern the information derived during the safety analysis. STPA (Systems-Theoretic Process Analysis) is a unique safety analysis approach based on system and control theory, and was developed to identify unsafe scenarios of a complex system including software. In this paper, we present a systematic and semi-automatic testing approach based on STPA to generate test cases from the STPA safety analysis results to help software and safety engineers to recognize and reduce the associated software risks. We also provide an open-source safety-based testing tool called STPA TCGenerator to support the proposed approach. We illustrate the proposed approach with a prototype of a software of the Adaptive Cruise Control System (ACC) with a stop-and-go function with a Lego-Mindstorms EV3 robot.

SENov 30, 2016Code

A Bayesian Network Approach to Assess and Predict Software Quality Using Activity-Based Quality Models

Stefan Wagner

Context: Software quality is a complex concept. Therefore, assessing and predicting it is still challenging in practice as well as in research. Activity-based quality models break down this complex concept into concrete definitions, more precisely facts about the system, process, and environment as well as their impact on activities performed on and with the system. However, these models lack an operationalisation that would allow them to be used in assessment and prediction of quality. Bayesian networks have been shown to be a viable means for this task incorporating variables with uncertainty. Objective: The qualitative knowledge contained in activity-based quality models are an abundant basis for building Bayesian networks for quality assessment. This paper describes a four-step approach for deriving systematically a Bayesian network from an assessment goal and a quality model. Method: The four steps of the approach are explained in detail and with running examples. Furthermore, an initial evaluation is performed, in which data from NASA projects and an open source system is obtained. The approach is applied to this data and its applicability is analysed. Results: The approach is applicable to the data from the NASA projects and the open source system. However, the predictive results vary depending on the availability and quality of the data, especially the underlying general distributions. Conclusion: The approach is viable in a realistic context but needs further investigation in case studies in order to analyse its predictive validity.

SENov 28, 2016Code

Operationalised product quality models and assessment: The Quamoco approach

Stefan Wagner, Andreas Goeb, Lars Heinemann et al.

Software quality models provide either abstract quality characteristics or concrete quality measurements; there is no seamless integration of these two aspects. Reasons for this include the complexity of quality and the various quality profiles in different domains which make it difficult to build operationalised quality models. In the project Quamoco, we developed a comprehensive approach for closing this gap. It combined constructive research, which involved quality experts from academia and industry in workshops, sprint work and reviews, with empirical studies. All deliverables within the project were peer-reviewed by two project members from a different area. Most deliverables were developed in two or three iterations and underwent an evaluation. We contribute a comprehensive quality modelling and assessment approach: (1) A meta quality model defines the structure of operationalised quality models. It includes the concept of a product factor, which bridges the gap between concrete measurements and abstract quality aspects, and allows modularisation to create modules for specific domains. (2) A largely technology-independent base quality model reduces the effort and complexity of building quality models for specific domains. For Java and C# systems, we refined it with about 300 concrete product factors and 500 measures. (3) A concrete and comprehensive quality assessment approach makes use of the concepts in the meta-model. (4) An empirical evaluation of the above results using real-world software systems. (5) The extensive, open-source tool support is in a mature state. (6) The model for embedded software systems is a proof-of-concept for domain-specific quality models. We provide a broad basis for the development and application of quality models in industrial practice as well as a basis for further extension, validation and comparison with other approaches in research.

SENov 28, 2016Code

Tool Support for Continuous Quality Control

Florian Deissenboeck, Stefan Wagner, Markus Pizka et al.

Over time, software systems suffer gradual quality decay and therefore costs can rise if organizations fail to take proactive countermeasures. Quality control is the first step to avoiding this cost trap. Continuous quality assessments help users identify quality problems early, when their removal is still inexpensive; they also aid decision making by providing an integrated view of a software system's current status. As a side effect, continuous and timely feedback helps developers and maintenance personnel improve their skills and thereby decreases the likelihood of future quality defects. To make regular quality control feasible, it must be highly automated, and assessment results must be presented in an aggregated manner to avoid overwhelming users with data. This article offers an overview of tools that aim to address these issues. The authors also discuss their own flexible, open-source toolkit, which supports the creation of dashboards for quality control.

SENov 22, 2016Code

Will My Tests Tell Me If I Break This Code?

Rainer Niedermayr, Elmar Juergens, Stefan Wagner

Automated tests play an important role in software evolution because they can rapidly detect faults introduced during changes. In practice, code-coverage metrics are often used as criteria to evaluate the effectiveness of test suites with focus on regression faults. However, code coverage only expresses which portion of a system has been executed by tests, but not how effective the tests actually are in detecting regression faults. Our goal was to evaluate the validity of code coverage as a measure for test effectiveness. To do so, we conducted an empirical study in which we applied an extreme mutation testing approach to analyze the tests of open-source projects written in Java. We assessed the ratio of pseudo-tested methods (those tested in a way such that faults would not be detected) to all covered methods and judged their impact on the software project. The results show that the ratio of pseudo-tested methods is acceptable for unit tests but not for system tests (that execute large portions of the whole system). Therefore, we conclude that the coverage metric is only a valid effectiveness indicator for unit tests.

SENov 21, 2016Code

The Use of Application Scanners in Software Product Quality Assessment

Stefan Wagner

Software development needs continuous quality control for a timely detection and removal of quality problems. This includes frequent quality assessments, which need to be automated as far as possible to be feasible. One way of automation in assessing the security of software are application scanners that test an executing software for vulnerabilities. At present, common quality assessments do not integrate such scanners for giving an overall quality statement. This paper presents an integration of application scanners into a general quality assessment method based on explicit quality models and Bayesian nets. Its applicability and the detection capabilities of common scanners are investigated in a case study with two open-source web shops.

SEJan 23

Adoption of Generative Artificial Intelligence in the German Software Engineering Industry: An Empirical Study

Ludwig Felder, Tobias Eisenreich, Mahsa Fischer et al.

Generative artificial intelligence (GenAI) tools have seen rapid adoption among software developers. While adoption rates in the industry are rising, the underlying factors influencing the effective use of these tools, including the depth of interaction, organizational constraints, and experience-related considerations, have not been thoroughly investigated. This issue is particularly relevant in environments with stringent regulatory requirements, such as Germany, where practitioners must address the GDPR and the EU AI Act while balancing productivity gains with intellectual property considerations. Despite the significant impact of GenAI on software engineering, to the best of our knowledge, no empirical study has systematically examined the adoption dynamics of GenAI tools within the German context. To address this gap, we present a comprehensive mixed-methods study on GenAI adoption among German software engineers. Specifically, we conducted 18 exploratory interviews with practitioners, followed by a developer survey with 109 participants. We analyze patterns of tool adoption, prompting strategies, and organizational factors that influence effectiveness. Our results indicate that experience level moderates the perceived benefits of GenAI tools, and productivity gains are not evenly distributed among developers. Further, organizational size affects both tool selection and the intensity of tool use. Limited awareness of the project context is identified as the most significant barrier. We summarize a set of actionable implications for developers, organizations, and tool vendors seeking to advance artificial intelligence (AI) assisted software development.

SEMar 27

Automating Domain-Driven Design: Experience with a Prompting Framework

Tobias Eisenreich, Husein Jusic, Stefan Wagner

Domain-driven design (DDD) is a powerful design technique for architecting complex software systems. This paper introduces a prompting framework that automates core DDD activities through structured large language model (LLM) interactions. We decompose DDD into five sequential steps: (1) establishing an ubiquitous language, (2) simulating event storming, (3) identifying bounded contexts, (4) designing aggregates, and (5) mapping to technical architecture. In a case study, we validated the prompting framework against real-world requirements from FTAPI's enterprise platform. While the first steps consistently generate valuable and usable artifacts, later steps show how minor errors or inaccuracies can propagate and accumulate. Overall, the framework excels as a collaborative sparring partner for building actionable documentation, such as glossaries and context maps, but not for full automation. This allows the experts to concentrate their discussion on the critical trade-offs. In our evaluation, Steps 1 to 3 worked well, but the accumulated errors rendered the artifacts generated from Steps 4 and 5 impractical. Our findings show that LLMs can enhance, but not replace, architectural expertise, offering a practical tool to reduce the effort and overhead of DDD while preserving human-centric decision-making.

SEMay 20, 2024

Naming the Pain in Machine Learning-Enabled Systems Engineering

Marcos Kalinowski, Daniel Mendez, Görkem Giray et al.

Context: Machine learning (ML)-enabled systems are being increasingly adopted by companies aiming to enhance their products and operational processes. Objective: This paper aims to deliver a comprehensive overview of the current status quo of engineering ML-enabled systems and lay the foundation to steer practically relevant and problem-driven academic research. Method: We conducted an international survey to collect insights from practitioners on the current practices and problems in engineering ML-enabled systems. We received 188 complete responses from 25 countries. We conducted quantitative statistical analyses on contemporary practices using bootstrapping with confidence intervals and qualitative analyses on the reported problems using open and axial coding procedures. Results: Our survey results reinforce and extend existing empirical evidence on engineering ML-enabled systems, providing additional insights into typical ML-enabled systems project contexts, the perceived relevance and complexity of ML life cycle phases, and current practices related to problem understanding, model deployment, and model monitoring. Furthermore, the qualitative analysis provides a detailed map of the problems practitioners face within each ML life cycle phase and the problems causing overall project failure. Conclusions: The results contribute to a better understanding of the status quo and problems in practical environments. We advocate for the further adaptation and dissemination of software engineering practices to enhance the engineering of ML-enabled systems.

SENov 12, 2025

Leveraging Large Language Models for Use Case Model Generation from Software Requirements

Tobias Eisenreich, Nicholas Friedlaender, Stefan Wagner

Use case modeling employs user-centered scenarios to outline system requirements. These help to achieve consensus among relevant stakeholders. Because the manual creation of use case models is demanding and time-consuming, it is often skipped in practice. This study explores the potential of Large Language Models (LLMs) to assist in this tedious process. The proposed method integrates an open-weight LLM to systematically extract actors and use cases from software requirements with advanced prompt engineering techniques. The method is evaluated using an exploratory study conducted with five professional software engineers, which compares traditional manual modeling to the proposed LLM-based approach. The results show a substantial acceleration, reducing the modeling time by 60\%. At the same time, the model quality remains on par. Besides improving the modeling efficiency, the participants indicated that the method provided valuable guidance in the process.

CEApr 8

Dead Code Doesn't Talk: Authentic Requirements Elicitation in Introductory Software Engineering

Santiago Berrezueta-Guzman, Vanesa Metaj, Stefan Wagner

Requirements elicitation is among the most communication-intensive activities in software engineering, yet it receives limited explicit treatment in undergraduate curricula. This paper presents a case study of an Introduction to Software Engineering course in which 20 student teams applied requirements elicitation practices to a Java-based 2D game they had built in a prior programming course, engaging 18 campus doctoral and postdoctoral researchers as authentic clients. Structured across four phases--preparation, client meeting, requirements elaboration, and a prototype sprint--the activity produced 203 elicited requirements, SRS documents with a mean quality score of $6.79 \pm 1.08$ out of 10, and prototype demonstrations scoring $7.21 \pm 1.15$. A pre/post self-assessment survey revealed statistically significant improvements across all eight measured soft-skill dimensions, with the largest gains in Stakeholder Empathy ($Î= +1.33$) and Negotiation ($Î= +1.13$). Thematic analysis of reflective reports identified four dominant learning themes, with the tension between client wishes and technical feasibility cited as the most professionally relevant experience. Our findings suggest that anchoring elicitation practice to a student-authored artifact lowers cognitive barriers while increasing authenticity, and that campus researchers serve as an accessible and effective proxy client for programs without established industry partnerships.

CYMay 28, 2025

From Coders to Critics: Empowering Students through Peer Assessment in the Age of AI Copilots

Santiago Berrezueta-Guzman, Stephan Krusche, Stefan Wagner

The rapid adoption of AI powered coding assistants like ChatGPT and other coding copilots is transforming programming education, raising questions about assessment practices, academic integrity, and skill development. As educators seek alternatives to traditional grading methods susceptible to AI enabled plagiarism, structured peer assessment could be a promising strategy. This paper presents an empirical study of a rubric based, anonymized peer review process implemented in a large introductory programming course. Students evaluated each other's final projects (2D game), and their assessments were compared to instructor grades using correlation, mean absolute error, and root mean square error (RMSE). Additionally, reflective surveys from 47 teams captured student perceptions of fairness, grading behavior, and preferences regarding grade aggregation. Results show that peer review can approximate instructor evaluation with moderate accuracy and foster student engagement, evaluative thinking, and interest in providing good feedback to their peers. We discuss these findings for designing scalable, trustworthy peer assessment systems to face the age of AI assisted coding.

SEMar 15, 2024

Large Language Models to Generate System-Level Test Programs Targeting Non-functional Properties

Denis Schwachhofer, Peter Domanski, Steffen Becker et al.

System-Level Test (SLT) has been a part of the test flow for integrated circuits for over a decade and still gains importance. However, no systematic approaches exist for test program generation, especially targeting non-functional properties of the Device under Test (DUT). Currently, test engineers manually compose test suites from off-the-shelf software, approximating the end-user environment of the DUT. This is a challenging and tedious task that does not guarantee sufficient control over non-functional properties. This paper proposes Large Language Models (LLMs) to generate test programs. We take a first glance at how pre-trained LLMs perform in test program generation to optimize non-functional properties of the DUT. Therefore, we write a prompt to generate C code snippets that maximize the instructions per cycle of a super-scalar, out-of-order architecture in simulation. Additionally, we apply prompt and hyperparameter optimization to achieve the best possible results without further training.

HCAug 17, 2025

iTrace: Click-Based Gaze Visualization on the Apple Vision Pro

Esra Mehmedova, Santiago Berrezueta-Guzman, Stefan Wagner

The Apple Vision Pro is equipped with accurate eye-tracking capabilities, yet the privacy restrictions on the device prevent direct access to continuous user gaze data. This study introduces iTrace, a novel application that overcomes these limitations through click-based gaze extraction techniques, including manual methods like a pinch gesture, and automatic approaches utilizing dwell control or a gaming controller. We developed a system with a client-server architecture that captures the gaze coordinates and transforms them into dynamic heatmaps for video and spatial eye tracking. The system can generate individual and averaged heatmaps, enabling analysis of personal and collective attention patterns. To demonstrate its effectiveness and evaluate the usability and performance, a study was conducted with two groups of 10 participants, each testing different clicking methods. The 8BitDo controller achieved higher average data collection rates at 14.22 clicks/s compared to 0.45 clicks/s with dwell control, enabling significantly denser heatmap visualizations. The resulting heatmaps reveal distinct attention patterns, including concentrated focus in lecture videos and broader scanning during problem-solving tasks. By allowing dynamic attention visualization while maintaining a high gaze precision of 91 %, iTrace demonstrates strong potential for a wide range of applications in educational content engagement, environmental design evaluation, marketing analysis, and clinical cognitive assessment. Despite the current gaze data restrictions on the Apple Vision Pro, we encourage developers to use iTrace only in research settings.

SEMay 23, 2025

ReqBrain: Task-Specific Instruction Tuning of LLMs for AI-Assisted Requirements Generation

Mohammad Kasra Habib, Daniel Graziotin, Stefan Wagner

Requirements elicitation and specification remains a labor-intensive, manual process prone to inconsistencies and gaps, presenting a significant challenge in modern software engineering. Emerging studies underscore the potential of employing large language models (LLMs) for automated requirements generation to support requirements elicitation and specification; however, it remains unclear how to implement this effectively. In this work, we introduce ReqBrain, an Al-assisted tool that employs a fine-tuned LLM to generate authentic and adequate software requirements. Software engineers can engage with ReqBrain through chat-based sessions to automatically generate software requirements and categorize them by type. We curated a high-quality dataset of ISO 29148-compliant requirements and fine-tuned five 7B-parameter LLMs to determine the most effective base model for ReqBrain. The top-performing model, Zephyr-7b-beta, achieved 89.30\% Fl using the BERT score and a FRUGAL score of 91.20 in generating authentic and adequate requirements. Human evaluations further confirmed ReqBrain's effectiveness in generating requirements. Our findings suggest that generative Al, when fine-tuned, has the potential to improve requirements elicitation and specification, paving the way for future extensions into areas such as defect identification, test case generation, and agile user story creation.

SEJan 25, 2024

From Requirements to Architecture: An AI-Based Journey to Semi-Automatically Generate Software Architectures

Tobias Eisenreich, Sandro Speth, Stefan Wagner

Designing domain models and software architectures represents a significant challenge in software development, as the resulting architectures play a vital role in fulfilling the system's quality of service. Due to time pressure, architects often model only one architecture based on their known limited domain understanding, patterns, and experience instead of thoroughly analyzing the domain and evaluating multiple candidates, selecting the best fitting. Existing approaches try to generate domain models based on requirements, but still require time-consuming manual effort to achieve good results. Therefore, in this vision paper, we propose a method to generate software architecture candidates semi-automatically based on requirements using artificial intelligence techniques. We further envision an automatic evaluation and trade-off analysis of the generated architecture candidates using, e.g., the architecture trade-off analysis method combined with large language models and quantitative analyses. To evaluate this approach, we aim to analyze the quality of the generated architecture models and the efficiency and effectiveness of our proposed process by conducting qualitative studies.

QUANT-PHOct 27, 2021

Cybersecurity for Quantum Computing

Natalie Kilber, Daniel Kaestle, Stefan Wagner

With rising cyberattack frequency and range, Quantum Computing companies, institutions and research groups may become targets of nation-state actors, cybercriminals and hacktivists for sabotage, espionage and fiscal motivations as the Quantum computing race intensifies. Quantum applications have expanded into commercial, classical information systems and services approaching the necessity to protect their networks, software, hardware and data from digital attacks. This paper discusses the status quo of quantum computing technologies and the quantum threat associated with it. We proceed to outline threat vectors for quantum computing systems and the respective defensive measures, mitigations and best practices to defend against the rapidly evolving threat landscape. We subsequently propose recommendations on how to proactively reduce the cyberattack surface through threat intelligence and by ensuring security by design of quantum software and hardware components.

SESep 28, 2021

Code Comprehension Confounders: A Study of Intelligence and Personal

Stefan Wagner, Marvin Wyrich

Literature and intuition suggest that a developer's intelligence and personality have an impact on their performance in comprehending source code. Researchers made this suggestion in the past when discussing threats to validity of their study results. However, the lack of studies investigating the relationship of intelligence and personality to performance in code comprehension makes scientifically sound reasoning about their influence difficult. We conduct the first empirical evaluation, a correlational study with undergraduates, to investigate the correlation of intelligence and personality with performance in code comprehension, that is with correctness in answering comprehension questions on code snippets. We found that personality traits are unlikely to impact code comprehension performance, at least not considered in isolation. Conscientiousness, in combination with other factors, however, explains some of the variance in code comprehension performance. For intelligence, significant small to moderate positive effects on code comprehension performance were found for three of four factors measured, i.e., fluid intelligence, visual perception, and cognitive speed. Crystallized intelligence has a positive but statistically insignificant effect on code comprehension performance. According to our results, several intelligence facets as well as the personality trait conscientiousness are potential confounders that should not be neglected in code comprehension studies of individual performance and should be controlled for via an appropriate study design. We call for the conduct of further studies on the relationship between intelligence and personality with code comprehension, in part because code comprehension involves more facets than we can measure in a single study and because our regression model explains only a small portion of the variance in code comprehension performance.

LGSep 1, 2021

Optimization Networks for Integrated Machine Learning

Michael Kommenda, Johannes Karder, Andreas Beham et al.

Optimization networks are a new methodology for holistically solving interrelated problems that have been developed with combinatorial optimization problems in mind. In this contribution we revisit the core principles of optimization networks and demonstrate their suitability for solving machine learning problems. We use feature selection in combination with linear model creation as a benchmark application and compare the results of optimization networks to ordinary least squares with optional elastic net regularization. Based on this example we justify the advantages of optimization networks by adapting the network to solve other machine learning problems. Finally, optimization analysis is presented, where optimal input values of a system have to be found to achieve desired output values. Optimization analysis can be divided into three subproblems: model creation to describe the system, model selection to choose the most appropriate one and parameter optimization to obtain the input values. Therefore, optimization networks are an obvious choice for handling optimization analysis tasks.

SEAug 6, 2021

Detecting Requirements Smells With Deep Learning: Experiences, Challenges and Future Work

Mohammad Kasra Habib, Stefan Wagner, Daniel Graziotin

Requirements Engineering (RE) is the initial step towards building a software system. The success or failure of a software project is firmly tied to this phase, based on communication among stakeholders using natural language. The problem with natural language is that it can easily lead to different understandings if it is not expressed precisely by the stakeholders involved, which results in building a product different from the expected one. Previous work proposed to enhance the quality of the software requirements detecting language errors based on ISO 29148 requirements language criteria. The existing solutions apply classical Natural Language Processing (NLP) to detect them. NLP has some limitations, such as domain dependability which results in poor generalization capability. Therefore, this work aims to improve the previous work by creating a manually labeled dataset and using ensemble learning, Deep Learning (DL), and techniques such as word embeddings and transfer learning to overcome the generalization problem that is tied with classical NLP and improve precision and recall metrics using a manually labeled dataset. The current findings show that the dataset is unbalanced and which class examples should be added more. It is tempting to train algorithms even if the dataset is not considerably representative. Whence, the results show that models are overfitting; in Machine Learning this issue is solved by adding more instances to the dataset, improving label quality, removing noise, and reducing the learning algorithms complexity, which is planned for this research.

LGJun 18, 2021

Learning to Plan via a Multi-Step Policy Regression Method

Stefan Wagner, Michael Janschek, Tobias Uelwer et al.

We propose a new approach to increase inference performance in environments that require a specific sequence of actions in order to be solved. This is for example the case for maze environments where ideally an optimal path is determined. Instead of learning a policy for a single step, we want to learn a policy that can predict n actions in advance. Our proposed method called policy horizon regression (PHR) uses knowledge of the environment sampled by A2C to learn an n dimensional policy vector in a policy distillation setup which yields n sequential actions per observation. We test our method on the MiniGrid and Pong environments and show drastic speedup during inference time by successfully predicting sequences of actions on a single observation.

SEMay 5, 2021

Software Engineering for AI-Based Systems: A Survey

Silverio Martínez-Fernández, Justus Bogner, Xavier Franch et al.

AI-based systems are software systems with functionalities enabled by at least one AI component (e.g., for image- and speech-recognition, and autonomous driving). AI-based systems are becoming pervasive in society due to advances in AI. However, there is limited synthesized knowledge on Software Engineering (SE) approaches for building, operating, and maintaining AI-based systems. To collect and analyze state-of-the-art knowledge about SE for AI-based systems, we conducted a systematic mapping study. We considered 248 studies published between January 2010 and March 2020. SE for AI-based systems is an emerging research area, where more than 2/3 of the studies have been published since 2018. The most studied properties of AI-based systems are dependability and safety. We identified multiple SE approaches for AI-based systems, which we classified according to the SWEBOK areas. Studies related to software testing and software quality are very prevalent, while areas like software maintenance seem neglected. Data-related issues are the most recurrent challenges. Our results are valuable for: researchers, to quickly understand the state of the art and learn which topics need more research; practitioners, to learn about the approaches and challenges that SE entails for AI-based systems; and, educators, to bridge the gap among SE and AI in their curricula.

CRMay 1, 2021

A systematic mapping study on security countermeasures of in-vehicle communication systems

Jinghua Yu, Stefan Wagner, Bowen Wang et al.

The innovations of vehicle connectivity have been increasing dramatically to enhance the safety and user experience of driving, while the rising numbers of interfaces to the external world also bring security threats to vehicles. Many security countermeasures have been proposed and discussed to protect the systems and services against attacks. To provide an overview of the current states in this research field, we conducted a systematic mapping study on the topic area "security countermeasures of in-vehicle communication systems". 279 papers are identified based on the defined study identification strategy and criteria. We discussed four research questions related to the security countermeasures, validation methods, publication patterns, and research trends and gaps based on the extracted and classified data. Finally, we evaluated the validity threats, the study identification results, and the whole mapping process. We found that the studies in this topic area are increasing rapidly in recent years. However, there are still gaps in various subtopics like automotive Ethernet security, anomaly reaction, and so on. This study reviews the target field not only related to research findings but also research activities, which can help identify research gaps at a high level and inspire new ideas for future work.

SEMar 15, 2021

Extreme mutation testing in practice: An industrial case study

Maik Betka, Stefan Wagner

Mutation testing is used to evaluate the effectiveness of test suites. In recent years, a promising variation called extreme mutation testing emerged that is computationally less expensive. It identifies methods where their functionality can be entirely removed, and the test suite would not notice it, despite having coverage. These methods are called pseudo-tested. In this paper, we compare the execution and analysis times for traditional and extreme mutation testing and discuss what they mean in practice. We look at how extreme mutation testing impacts current software development practices and discuss open challenges that need to be addressed to foster industry adoption. For that, we conducted an industrial case study consisting of running traditional and extreme mutation testing in a large software project from the semiconductor industry that is covered by a test suite of more than 11,000 unit tests. In addition to that, we did a qualitative analysis of 25 pseudo-tested methods and interviewed two experienced developers to see how they write unit tests and gathered opinions on how useful the findings of extreme mutation testing are. Our results include execution times, scores, numbers of executed tests and mutators, reasons why methods are pseudo-tested, and an interview summary. We conclude that the shorter execution and analysis times are well noticeable in practice and show that extreme mutation testing supplements writing unit tests in conjunction with code coverage tools. We propose that pseudo-tested code should be highlighted in code coverage reports and that extreme mutation testing should be performed when writing unit tests rather than in a decoupled session. Future research should investigate how to perform extreme mutation testing while writing unit tests such that the results are available fast enough but still meaningful.

ARMar 11, 2021

Exploring the Mysteries of System-Level Test

Ilia Polian, Jens Anders, Steffen Becker et al.

System-level test, or SLT, is an increasingly important process step in today's integrated circuit testing flows. Broadly speaking, SLT aims at executing functional workloads in operational modes. In this paper, we consolidate available knowledge about what SLT is precisely and why it is used despite its considerable costs and complexities. We discuss the types or failures covered by SLT, and outline approaches to quality assessment, test generation and root-cause diagnosis in the context of SLT. Observing that the theoretical understanding for all these questions has not yet reached the level of maturity of the more conventional structural and functional test methods, we outline new and promising directions for methodical developments leveraging on recent findings from software engineering.

CRFeb 1, 2021

Using a Cyber Digital Twin for Continuous Automotive Security Requirements Verification

Ana Cristina Franco da Silva, Stefan Wagner, Eddie Lazebnik et al.

A Digital Twin (DT) is a digital representation of a physical object used to simulate it before it is built or to predict failures after the object is deployed. In this article, we introduce our approach, which applies the concept of a Cyber Digital Twin (CDT) to automotive software for the purpose of security analysis. In our approach, automotive firmware is transformed into a CDT, which contains automatically extracted, security-relevant information from the firmware. Based on the CDT, we evaluate security requirements through automated analysis and requirements verification using policy enforcement checks and vulnerabilities detection. The evaluation of a CDT is conducted continuously integrating new checks derived from new security requirements and from newly disclosed vulnerabilities. We applied our approach to about 100 automotive firmwares. In average, about 600 publicly disclosed vulnerabilities and 80 unknown weaknesses were detected per firmware in the pre-production phase. Therefore, the use of a CDT enables efficient continuous verification of security requirements.

SEJan 29, 2021

Résumé-Driven Development: A Definition and Empirical Characterization

Jonas Fritzsch, Marvin Wyrich, Justus Bogner et al.

Technologies play an important role in the hiring process for software professionals. Within this process, several studies revealed misconceptions and bad practices which lead to suboptimal recruitment experiences. In the same context, grey literature anecdotally coined the term Résumé-Driven Development (RDD), a phenomenon describing the overemphasis of trending technologies in both job offerings and resumes as an interaction between employers and applicants. While RDD has been sporadically mentioned in books and online discussions, there are so far no scientific studies on the topic, despite its potential negative consequences. We therefore empirically investigated this phenomenon by surveying 591 software professionals in both hiring (130) and technical (558) roles and identified RDD facets in substantial parts of our sample: 60% of our hiring professionals agreed that trends influence their job offerings, while 82% of our software professionals believed that using trending technologies in their daily work makes them more attractive for prospective employers. Grounded in the survey results, we conceptualize a theory to frame and explain Résumé-Driven Development. Finally, we discuss influencing factors and consequences and propose a definition of the term. Our contribution provides a foundation for future research and raises awareness for a potentially systemic trend that may broadly affect the software industry.

SEJan 27, 2021

Testing in Global Software Development -- A Pattern Approach

Anneke Pehmöller, Frank Salger, Stefan Wagner

Although testing is critical in GSD, its application in this context has not been deeply investigated so far. This work investigates testing in GSD. It provides support for test managers acting in a globally distributed environment. With this it closes a gap. The leading question is "What problems exist in testing in GSD and how can they be addressed in projects?" Decomposing this question we a) identify problems of testing in GSD projects and b) provide good practices to support practitioners in testing in GSD projects. The research is realized in the context of Capgemini Germany. Our contribution to solving the stated research problem is a collection of 16 patterns for testing in GSD projects. For practitioners the usage of the patterns is simplified by various views on the patterns. Herewith we stipulate research and support project managers and test managers in the realization of testing in GSD projects.

SEJan 18, 2021

Formal Verification of a Fail-Operational Automotive Driving System

Tobias Schmid, Stefanie Schraufstetter, Jonas Fritzsch et al.

A fail-operational system for highly automated driving must complete the driving task even in the presence of a failure. This requires redundant architectures and a mechanism to reconfigure the system in case of a failure. Therefore, an arbitration logic is used. For functional safety, the switch-over to a fall-back level must be conducted in the presence of any electric and electronic failure. To provide evidence for a safety argumentation in compliance with ISO 26262, verification of the arbitration logic is necessary. The verification process provides confirmation of the correct failure reactions and that no unintended system states are attainable. Conventional safety analyses, such as the failure mode and effect analysis, have its limits in this regard. We present an analytical approach based on formal verification, in particular model checking, to verify the fail-operational behaviour of a driving system. For that reason, we model the system behaviour and the relevant architecture and formally specify the safety requirements. The scope of the analysis is defined according to the requirements of ISO 26262. We verify a fail-operational arbitration logic for highly automated driving in compliance with the industry standard. Our results show that formal methods for safety evaluation in automotive fail-operational driving systems can be successfully applied. We were able to detect failures, which would have been overlooked by other analyses and thus contribute to the development of safety critical functions.

SEDec 16, 2020

The Mind Is a Powerful Place: How Showing Code Comprehensibility Metrics Influences Code Understanding

Marvin Wyrich, Andreas Preikschat, Daniel Graziotin et al.

Static code analysis tools and integrated development environments present developers with quality-related software metrics, some of which describe the understandability of source code. Software metrics influence overarching strategic decisions that impact the future of companies and the prioritization of everyday software development tasks. Several software metrics, however, lack in validation: we just choose to trust that they reflect what they are supposed to measure. Some of them were even shown to not measure the quality aspects they intend to measure. Yet, they influence us through biases in our cognitive-driven actions. In particular, they might anchor us in our decisions. Whether the anchoring effect exists with software metrics has not been studied yet. We conducted a randomized and double-blind experiment to investigate the extent to which a displayed metric value for source code comprehensibility anchors developers in their subjective rating of source code comprehensibility, whether performance is affected by the anchoring effect when working on comprehension tasks, and which individual characteristics might play a role in the anchoring effect. We found that the displayed value of a comprehensibility metric has a significant and large anchoring effect on a developer's code comprehensibility rating. The effect does not seem to affect the time or correctness when working on comprehension questions related to the code snippets under study. Since the anchoring effect is one of the most robust cognitive biases, and we have limited understanding of the consequences of the demonstrated manipulation of developers by non-validated metrics, we call for an increased awareness of the responsibility in code quality reporting and for corresponding tools to be based on scientific evidence.

SENov 20, 2020

Experiences from Large-Scale Model Checking: Verification of a Vehicle Control System

Jonas Fritzsch, Tobias Schmid, Stefan Wagner

In the age of autonomously driving vehicles, functionality and complexity of embedded systems are increasing tremendously. Safety aspects become more important and require such systems to operate with the highest possible level of fault tolerance. Simulation and systematic testing techniques have reached their limits in this regard. Here, formal verification as a long established technique can be an appropriate complement. However, the necessary preparatory work like adequately modeling a system and specifying properties in temporal logic are anything but trivial. In this paper, we report on our experiences applying model checking to verify the arbitration logic of a Vehicle Control System. We balance pros and cons of different model checking techniques and tools, and reason about our choice of the symbolic model checker NuSMV. We describe the process of modeling the architecture, resulting in ~1500 LOC, 69 state variables and 38 LTL constraints. To handle this large-scale model, we automate and optimize the model checking procedure for use on multi-core CPUs and employ Bounded Model Checking to avoid the state explosion problem. We share our lessons learned and provide valuable insights for architects, developers, and test engineers involved in this highly present topic.