Markus Borg

SE
h-index24
45papers
949citations
Novelty21%
AI Score37

45 Papers

SEApr 16, 2022Code
Ergo, SMIRK is Safe: A Safety Case for a Machine Learning Component in a Pedestrian Automatic Emergency Brake System

Markus Borg, Jens Henriksson, Kasper Socha et al.

Integration of Machine Learning (ML) components in critical applications introduces novel challenges for software certification and verification. New safety standards and technical guidelines are under development to support the safety of ML-based systems, e.g., ISO 21448 SOTIF for the automotive domain and the Assurance of Machine Learning for use in Autonomous Systems (AMLAS) framework. SOTIF and AMLAS provide high-level guidance but the details must be chiseled out for each specific case. We initiated a research project with the goal to demonstrate a complete safety case for an ML component in an open automotive system. This paper reports results from an industry-academia collaboration on safety assurance of SMIRK, an ML-based pedestrian automatic emergency braking demonstrator running in an industry-grade simulator. We demonstrate an application of AMLAS on SMIRK for a minimalistic operational design domain, i.e., we share a complete safety case for its integrated ML-based component. Finally, we report lessons learned and provide both SMIRK and the safety case under an open-source licence for the research community to reuse.

LGApr 26, 2022
Performance Analysis of Out-of-Distribution Detection on Trained Neural Networks

Jens Henriksson, Christian Berger, Markus Borg et al.

Several areas have been improved with Deep Learning during the past years. Implementing Deep Neural Networks (DNN) for non-safety related applications have shown remarkable achievements over the past years; however, for using DNNs in safety critical applications, we are missing approaches for verifying the robustness of such models. A common challenge for DNNs occurs when exposed to out-of-distribution samples that are outside of the scope of a DNN, but which result in high confidence outputs despite no prior knowledge of such input. In this paper, we analyze three methods that separate between in- and out-of-distribution data, called supervisors, on four well-known DNN architectures. We find that the outlier detection performance improves with the quality of the model. We also analyse the performance of the particular supervisors during the training procedure by applying the supervisor at a predefined interval to investigate its performance as the training proceeds. We observe that understanding the relationship between training results and supervisor performance is crucial to improve the model's robustness and to indicate, what input samples require further measures to improve the robustness of a DNN. In addition, our work paves the road towards an instrument for safety argumentation for safety critical applications. This paper is an extended version of our previous work presented at 2019 SEAA (cf. [1]); here, we elaborate on the used metrics, add an additional supervisor and test them on two additional datasets.

SEMar 10, 2023
Automotive Perception Software Development: An Empirical Investigation into Data, Annotation, and Ecosystem Challenges

Hans-Martin Heyn, Khan Mohammad Habibullah, Eric Knauss et al.

Software that contains machine learning algorithms is an integral part of automotive perception, for example, in driving automation systems. The development of such software, specifically the training and validation of the machine learning components, require large annotated datasets. An industry of data and annotation services has emerged to serve the development of such data-intensive automotive software components. Wide-spread difficulties to specify data and annotation needs challenge collaborations between OEMs (Original Equipment Manufacturers) and their suppliers of software components, data, and annotations. This paper investigates the reasons for these difficulties for practitioners in the Swedish automotive industry to arrive at clear specifications for data and annotations. The results from an interview study show that a lack of effective metrics for data quality aspects, ambiguities in the way of working, unclear definitions of annotation quality, and deficits in the business ecosystems are causes for the difficulty in deriving the specifications. We provide a list of recommendations that can mitigate challenges when deriving specifications and we propose future research opportunities to overcome these challenges. Our work contributes towards the on-going research on accountability of machine learning as applied to complex software systems, especially for high-stake applications such as automated driving.

SEMar 22, 2022
Machine Learning Testing in an ADAS Case Study Using Simulation-Integrated Bio-Inspired Search-Based Testing

Mahshid Helali Moghadam, Markus Borg, Mehrdad Saadatmand et al.

This paper presents an extended version of Deeper, a search-based simulation-integrated test solution that generates failure-revealing test scenarios for testing a deep neural network-based lane-keeping system. In the newly proposed version, we utilize a new set of bio-inspired search algorithms, genetic algorithm (GA), $(μ+λ)$ and $(μ,λ)$ evolution strategies (ES), and particle swarm optimization (PSO), that leverage a quality population seed and domain-specific cross-over and mutation operations tailored for the presentation model used for modeling the test scenarios. In order to demonstrate the capabilities of the new test generators within Deeper, we carry out an empirical evaluation and comparison with regard to the results of five participating tools in the cyber-physical systems testing competition at SBST 2021. Our evaluation shows the newly proposed test generators in Deeper not only represent a considerable improvement on the previous version but also prove to be effective and efficient in provoking a considerable number of diverse failure-revealing test scenarios for testing an ML-driven lane-keeping system. They can trigger several failures while promoting test scenario diversity, under a limited test time budget, high target failure severity, and strict speed limit constraints.

SEMar 29, 2022
Quality Assurance of Generative Dialog Models in an Evolving Conversational Agent Used for Swedish Language Practice

Markus Borg, Johan Bengtsson, Harald Österling et al.

Due to the migration megatrend, efficient and effective second-language acquisition is vital. One proposed solution involves AI-enabled conversational agents for person-centered interactive language practice. We present results from ongoing action research targeting quality assurance of proprietary generative dialog models trained for virtual job interviews. The action team elicited a set of 38 requirements for which we designed corresponding automated test cases for 15 of particular interest to the evolving solution. Our results show that six of the test case designs can detect meaningful differences between candidate models. While quality assurance of natural language processing applications is complex, we provide initial steps toward an automated framework for machine learning model selection in the context of an evolving conversational agent. Future work will focus on model selection in an MLOps setting.

CLOct 13, 2022
Automotive Multilingual Fault Diagnosis

John Pavlopoulos, Alv Romell, Jacob Curman et al.

Automated fault diagnosis can facilitate diagnostics assistance, speedier troubleshooting, and better-organised logistics. Currently, AI-based prognostics and health management in the automotive industry ignore the textual descriptions of the experienced problems or symptoms. With this study, however, we show that a multilingual pre-trained Transformer can effectively classify the textual claims from a large company with vehicle fleets, despite the task's challenging nature due to the 38 languages and 1,357 classes involved. Overall, we report an accuracy of more than 80% for high-frequency classes and above 60% for above-low-frequency classes, bringing novel evidence that multilingual classification can benefit automotive troubleshooting management.

SEJan 5
Code for Machines, Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics

Markus Borg, Nadim Hagatulah, Adam Tornhill et al.

We are entering a hybrid era in which human developers and AI coding agents work in the same codebases. While industry practice has long optimized code for human comprehension, it is increasingly important to ensure that LLMs with different capabilities can edit code reliably. In this study, we investigate the concept of ``AI-friendly code'' via LLM-based refactoring on a dataset of 5,000 Python files from competitive programming. We find a meaningful association between CodeHealth, a quality metric calibrated for human comprehension, and semantic preservation after AI refactoring. Our findings confirm that human-friendly code is also more compatible with AI tooling. These results suggest that organizations can use CodeHealth to guide where AI interventions are lower risk and where additional human oversight is warranted. Investing in maintainability not only helps humans; it also prepares for large-scale AI adoption.

LGNov 16, 2021Code
Machine Learning-Assisted Analysis of Small Angle X-ray Scattering

Piotr Tomaszewski, Shun Yu, Markus Borg et al.

Small angle X-ray scattering (SAXS) is extensively used in materials science as a way of examining nanostructures. The analysis of experimental SAXS data involves mapping a rather simple data format to a vast amount of structural models. Despite various scientific computing tools to assist the model selection, the activity heavily relies on the SAXS analysts' experience, which is recognized as an efficiency bottleneck by the community. To cope with this decision-making problem, we develop and evaluate the open-source, Machine Learning-based tool SCAN (SCattering Ai aNalysis) to provide recommendations on model selection. SCAN exploits multiple machine learning algorithms and uses models and a simulation tool implemented in the SasView package for generating a well defined set of datasets. Our evaluation shows that SCAN delivers an overall accuracy of 95%-97%. The XGBoost Classifier has been identified as the most accurate method with a good balance between accuracy and training time. With eleven predefined structural models for common nanostructures and an easy draw-drop function to expand the number and types training models, SCAN can accelerate the SAXS data analysis workflow.

SEJun 11, 2019Code
Sharing of vulnerability information among companies -- a survey of Swedish companies

Thomas Olsson, Martin Hell, Martin Höst et al.

Software products are rarely developed from scratch and vulnerabilities in such products might reside in parts that are either open source software or provided by another organization. Hence, the total cybersecurity of a product often depends on cooperation, explicit or implicit, between several organizations. We study the attitudes and practices of companies in software ecosystems towards sharing vulnerability information. Furthermore, we compare these practices to contemporary cybersecurity recommendations. This is performed through a questionnaire-based qualitative survey. The questionnaire is divided into two parts: the providers' perspective and the acquirers' perspective. The results show that companies are willing to share information with each other regarding vulnerabilities. Sharing is not considered to be harmful neither to the cybersecurity nor their business, even though a majority of the respondents consider vulnerability information sensitive. However, the companies, despite being open to sharing, are less inclined to proactively sharing vulnerability information. Furthermore, the providers do not perceive that there is a large interest in vulnerability information from their customers. Hence, the companies' overall attitude to sharing vulnerability information is passive but open. In contrast, contemporary cybersecurity guidelines recommend active disclosure and sharing among actors in an ecosystem.

SEJul 28, 2018Code
Goal-Oriented Mutation Testing with Focal Methods

Sten Vercammen, Mohammad Ghafari, Serge Demeyer et al.

Mutation testing is the state-of-the-art technique for assessing the fault-detection capacity of a test suite. Unfortunately, mutation testing consumes enormous computing resources because it runs the whole test suite for each and every injected mutant. In this paper we explore fine-grained traceability links at method level (named focal methods), to reduce the execution time of mutation testing and to verify the quality of the test cases for each individual method, instead of the usually verified overall test suite quality. Validation of our approach on the open source Apache Ant project shows a speed-up of 573.5x for the mutants located in focal methods with a quality score of 80%.

LGJan 30, 2024
Evaluation of Out-of-Distribution Detection Performance on Autonomous Driving Datasets

Jens Henriksson, Christian Berger, Stig Ursing et al.

Safety measures need to be systemically investigated to what extent they evaluate the intended performance of Deep Neural Networks (DNNs) for critical applications. Due to a lack of verification methods for high-dimensional DNNs, a trade-off is needed between accepted performance and handling of out-of-distribution (OOD) samples. This work evaluates rejecting outputs from semantic segmentation DNNs by applying a Mahalanobis distance (MD) based on the most probable class-conditional Gaussian distribution for the predicted class as an OOD score. The evaluation follows three DNNs trained on the Cityscapes dataset and tested on four automotive datasets and finds that classification risk can drastically be reduced at the cost of pixel coverage, even when applied on unseen datasets. The applicability of our findings will support legitimizing safety measures and motivate their usage when arguing for safe usage of DNNs in automotive perception.

SEDec 20, 2024
Trust Calibration in IDEs: Paving the Way for Widespread Adoption of AI Refactoring

Markus Borg

In the software industry, the drive to add new features often overshadows the need to improve existing code. Large Language Models (LLMs) offer a new approach to improving codebases at an unprecedented scale through AI-assisted refactoring. However, LLMs come with inherent risks such as braking changes and the introduction of security vulnerabilities. We advocate for encapsulating the interaction with the models in IDEs and validating refactoring attempts using trustworthy safeguards. However, equally important for the uptake of AI refactoring is research on trust development. In this position paper, we position our future work based on established models from research on human factors in automation. We outline action research within CodeScene on development of 1) novel LLM safeguards and 2) user interaction that conveys an appropriate level of trust. The industry collaboration enables large-scale repository analysis and A/B testing to continuously guide the design of our research interventions.

SEJul 1, 2025
Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability

Markus Borg, Dave Hewett, Nadim Hagatulah et al.

[Context] AI assistants, like GitHub Copilot and Cursor, are transforming software engineering. While several studies highlight productivity improvements, their impact on maintainability requires further investigation. [Objective] This study investigates whether co-development with AI assistants affects software maintainability, specifically how easily other developers can evolve the resulting source code. [Method] We conducted a two-phase controlled experiment involving 151 participants, 95% of whom were professional developers. In Phase 1, participants added a new feature to a Java web application, with or without AI assistance. In Phase 2, a randomized controlled trial, new participants evolved these solutions without AI assistance. [Results] AI-assisted development in Phase 1 led to a modest speedup in subsequent evolution and slightly higher average CodeHealth. Although neither difference was significant overall, the increase in CodeHealth was statistically significant when habitual AI users completed Phase 1. For Phase 1, we also observed a significant effect that corroborates previous productivity findings: using an AI assistant yielded a 30.7% median decrease in task completion time. Moreover, for habitual AI users, the mean speedup was 55.9%. [Conclusions] Our study adds to the growing evidence that AI assistants can effectively accelerate development. Moreover, we did not observe warning signs of degraded code-level maintainability. We recommend that future research focus on risks such as code bloat from excessive code generation and the build-up of cognitive debt as developers invest less mental effort during implementation.

SEMar 30, 2022
Exploring ML testing in practice -- Lessons learned from an interactive rapid review with Axis Communications

Qunying Song, Markus Borg, Emelie Engström et al.

There is a growing interest in industry and academia in machine learning (ML) testing. We believe that industry and academia need to learn together to produce rigorous and relevant knowledge. In this study, we initiate a collaboration between stakeholders from one case company, one research institute, and one university. To establish a common view of the problem domain, we applied an interactive rapid review of the state of the art. Four researchers from Lund University and RISE Research Institutes and four practitioners from Axis Communications reviewed a set of 180 primary studies on ML testing. We developed a taxonomy for the communication around ML testing challenges and results and identified a list of 12 review questions relevant for Axis Communications. The three most important questions (data testing, metrics for assessment, and test generation) were mapped to the literature, and an in-depth analysis of the 35 primary studies matching the most important question (data testing) was made. A final set of the five best matches were analysed and we reflect on the criteria for applicability and relevance for the industry. The taxonomies are helpful for communication but not final. Furthermore, there was no perfect match to the case company's investigated review question (data testing). However, we extracted relevant approaches from the five studies on a conceptual level to support later context-specific improvements. We found the interactive rapid review approach useful for triggering and aligning communication between the different stakeholders.

SENov 28, 2021
Agility in Software 2.0 -- Notebook Interfaces and MLOps with Buttresses and Rebars

Markus Borg

Artificial intelligence through machine learning is increasingly used in the digital society. Solutions based on machine learning bring both great opportunities, thus coined "Software 2.0," but also great challenges for the engineering community to tackle. Due to the experimental approach used by data scientists when developing machine learning models, agility is an essential characteristic. In this keynote address, we discuss two contemporary development phenomena that are fundamental in machine learning development, i.e., notebook interfaces and MLOps. First, we present a solution that can remedy some of the intrinsic weaknesses of working in notebooks by supporting easy transitions to integrated development environments. Second, we propose reinforced engineering of AI systems by introducing metaphorical buttresses and rebars in the MLOps context. Machine learning-based solutions are dynamic in nature, and we argue that reinforced continuous engineering is required to quality assure the trustworthy AI systems of tomorrow.

SESep 28, 2021
Adopting Automated Bug Assignment in Practice -- A Registered Report of an Industrial Case Study

Markus Borg, Leif Jonsson, Emelie Engström et al.

[Background/Context] The continuous inflow of bug reports is a considerable challenge in large development projects. Inspired by contemporary work on mining software repositories, we designed a prototype bug assignment solution based on machine learning in 2011-2016. The prototype evolved into an internal Ericsson product, TRR, in 2017-2018. TRR's first bug assignment without human intervention happened in 2019. [Objective/Aim] Our exploratory study will evaluate the adoption of TRR within its industrial context at Ericsson. We seek to understand 1) how TRR performs in the field, 2) what value TRR provides to Ericsson, and 3) how TRR has influenced the ways of working. Secondly, we will provide lessons learned related to productization of a research prototype within a company. [Method] We design an industrial case study combining interviews with TRR developers and users with analysis of data extracted from the bug tracking system at Ericsson. Furthermore, we will analyze sprint planning meetings recorded during the productization. Our data analysis will include thematic analysis, descriptive statistics, and Bayesian causal analysis.

AISep 16, 2021
Efficient and Effective Generation of Test Cases for Pedestrian Detection -- Search-based Software Testing of Baidu Apollo in SVL

Hamid Ebadi, Mahshid Helali Moghadam, Markus Borg et al.

With the growing capabilities of autonomous vehicles, there is a higher demand for sophisticated and pragmatic quality assurance approaches for machine learning-enabled systems in the automotive AI context. The use of simulation-based prototyping platforms provides the possibility for early-stage testing, enabling inexpensive testing and the ability to capture critical corner-case test scenarios. Simulation-based testing properly complements conventional on-road testing. However, due to the large space of test input parameters in these systems, the efficient generation of effective test scenarios leading to the unveiling of failures is a challenge. This paper presents a study on testing pedestrian detection and emergency braking system of the Baidu Apollo autonomous driving platform within the SVL simulator. We propose an evolutionary automated test generation technique that generates failure-revealing scenarios for Apollo in the SVL environment. Our approach models the input space using a generic and flexible data structure and benefits a multi-criteria safety-based heuristic for the objective function targeted for optimization. This paper presents the results of our proposed test generation technique in the 2021 IEEE Autonomous Driving AI Test Challenge. In order to demonstrate the efficiency and effectiveness of our approach, we also report the results from a baseline random generation technique. Our evaluation shows that the proposed evolutionary test case generator is more effective at generating failure-revealing test cases and provides higher diversity between the generated failures than the random baseline.

SEApr 28, 2021
Challenges of Adopting SAFe in the Banking Industry -- A Study Two Years after its Introduction

Sara Nilsson Tengstrand, Piotr Tomaszewski, Markus Borg et al.

The Scaled Agile Framework (SAFe) is a framework for scaling agile methods in large organizations. We have found several experience reports and white papers describing SAFe adoptions in different banks, which indicates that SAFe is being used in the banking industry. However, there is a lack of academic publications on the topic, the banking industry is missing in the scientific reports analyzing SAFe transformations. To fill this gap, we present a study on the main challenges with a SAFe transformation at a large full-service bank. We identify the challenges in the bank under study and compare the findings with experience reports from other banks, as well as with research on SAFe transformations in other domains. Many of the challenges reported in this paper overlap with the generic SAFe challenges, including management and organization, education and training, culture and mindset, requirements engineering, quality assurance, and systems architecture. However, we also report some novel challenges specific to the banking domain, e.g., the risk of jeopardizing customer relations, stability, and trust of external stakeholders. This study validates several SAFe-related challenges reported in previous work in the banking context. It also brings up some novel challenges specific to the banking industry. Therefore, we believe our results are particularly useful to practitioners responsible for SAFe transformations at other banks.

SEApr 26, 2021
Performance Testing Using a Smart Reinforcement Learning-Driven Test Agent

Mahshid Helali Moghadam, Golrokh Hamidi, Markus Borg et al.

Performance testing with the aim of generating an efficient and effective workload to identify performance issues is challenging. Many of the automated approaches mainly rely on analyzing system models, source code, or extracting the usage pattern of the system during the execution. However, such information and artifacts are not always available. Moreover, all the transactions within a generated workload do not impact the performance of the system the same way, a finely tuned workload could accomplish the test objective in an efficient way. Model-free reinforcement learning is widely used for finding the optimal behavior to accomplish an objective in many decision-making problems without relying on a model of the system. This paper proposes that if the optimal policy (way) for generating test workload to meet a test objective can be learned by a test agent, then efficient test automation would be possible without relying on system models or source code. We present a self-adaptive reinforcement learning-driven load testing agent, RELOAD, that learns the optimal policy for test workload generation and generates an effective workload efficiently to meet the test objective. Once the agent learns the optimal policy, it can reuse the learned policy in subsequent testing activities. Our experiments show that the proposed intelligent load test agent can accomplish the test objective with lower test cost compared to common load testing procedures, and results in higher test efficiency.

LGMar 29, 2021
Performance Analysis of Out-of-Distribution Detection on Various Trained Neural Networks

Jens Henriksson, Christian Berger, Markus Borg et al.

Several areas have been improved with Deep Learning during the past years. For non-safety related products adoption of AI and ML is not an issue, whereas in safety critical applications, robustness of such approaches is still an issue. A common challenge for Deep Neural Networks (DNN) occur when exposed to out-of-distribution samples that are previously unseen, where DNNs can yield high confidence predictions despite no prior knowledge of the input. In this paper we analyse two supervisors on two well-known DNNs with varied setups of training and find that the outlier detection performance improves with the quality of the training procedure. We analyse the performance of the supervisor after each epoch during the training cycle, to investigate supervisor performance as the accuracy converges. Understanding the relationship between training results and supervisor performance is valuable to improve robustness of the model and indicates where more work has to be done to create generalized models for safety critical applications.

CYMar 4, 2021
Exploring the Assessment List for Trustworthy AI in the Context of Advanced Driver-Assistance Systems

Markus Borg, Joshua Bronson, Linus Christensson et al.

Artificial Intelligence (AI) is increasingly used in critical applications. Thus, the need for dependable AI systems is rapidly growing. In 2018, the European Commission appointed experts to a High-Level Expert Group on AI (AI-HLEG). AI-HLEG defined Trustworthy AI as 1) lawful, 2) ethical, and 3) robust and specified seven corresponding key requirements. To help development organizations, AI-HLEG recently published the Assessment List for Trustworthy AI (ALTAI). We present an illustrative case study from applying ALTAI to an ongoing development project of an Advanced Driver-Assistance System (ADAS) that relies on Machine Learning (ML). Our experience shows that ALTAI is largely applicable to ADAS development, but specific parts related to human agency and transparency can be disregarded. Moreover, bigger questions related to societal and environmental impact cannot be tackled by an ADAS supplier in isolation. We present how we plan to develop the ADAS to ensure ALTAI-compliance. Finally, we provide three recommendations for the next revision of ALTAI, i.e., life-cycle variants, domain-specific adaptations, and removed redundancy.

SEMar 2, 2021
Test Automation with Grad-CAM Heatmaps -- A Future Pipe Segment in MLOps for Vision AI?

Markus Borg, Ronald Jabangwe, Simon Åberg et al.

Machine Learning (ML) is a fundamental part of modern perception systems. In the last decade, the performance of computer vision using trained deep neural networks has outperformed previous approaches based on careful feature engineering. However, the opaqueness of large ML models is a substantial impediment for critical applications such as in the automotive context. As a remedy, Gradient-weighted Class Activation Mapping (Grad-CAM) has been proposed to provide visual explanations of model internals. In this paper, we demonstrate how Grad-CAM heatmaps can be used to increase the explainability of an image recognition model trained for a pedestrian underpass. We argue how the heatmaps support compliance to the EU's seven key requirements for Trustworthy AI. Finally, we propose adding automated heatmap analysis as a pipe segment in an MLOps pipeline. We believe that such a building block can be used to automatically detect if a trained ML-model is activated based on invalid pixels in test images, suggesting biased models.

SEDec 12, 2020
Digital Twins Are Not Monozygotic -- Cross-Replicating ADAS Testing in Two Industry-Grade Automotive Simulators

Markus Borg, Raja Ben Abdessalem, Shiva Nejati et al.

The increasing levels of software- and data-intensive driving automation call for an evolution of automotive software testing. As a recommended practice of the Verification and Validation (V&V) process of ISO/PAS 21448, a candidate standard for safety of the intended functionality for road vehicles, simulation-based testing has the potential to reduce both risks and costs. There is a growing body of research on devising test automation techniques using simulators for Advanced Driver-Assistance Systems (ADAS). However, how similar are the results if the same test scenarios are executed in different simulators? We conduct a replication study of applying a Search-Based Software Testing (SBST) solution to a real-world ADAS (PeVi, a pedestrian vision detection system) using two different commercial simulators, namely, TASS/Siemens PreScan and ESI Pro-SiVIC. Based on a minimalistic scene, we compare critical test scenarios generated using our SBST solution in these two simulators. We show that SBST can be used to effectively and efficiently generate critical test scenarios in both simulators, and the test results obtained from the two simulators can reveal several weaknesses of the ADAS under test. However, executing the same test scenarios in the two simulators leads to notable differences in the details of the test outputs, in particular, related to (1) safety violations revealed by tests, and (2) dynamics of cars and pedestrians. Based on our findings, we recommend future V&V plans to include multiple simulators to support robust simulation-based testing and to base test objectives on measures that are less dependant on the internals of the simulators.

CVSep 11, 2020
Enabling Image Recognition on Constrained Devices Using Neural Network Pruning and a CycleGAN

August Lidfelt, Daniel Isaksson, Ludwig Hedlund et al.

Smart cameras are increasingly used in surveillance solutions in public spaces. Contemporary computer vision applications can be used to recognize events that require intervention by emergency services. Smart cameras can be mounted in locations where citizens feel particularly unsafe, e.g., pathways and underpasses with a history of incidents. One promising approach for smart cameras is edge AI, i.e., deploying AI technology on IoT devices. However, implementing resource-demanding technology such as image recognition using deep neural networks (DNN) on constrained devices is a substantial challenge. In this paper, we explore two approaches to reduce the need for compute in contemporary image recognition in an underpass. First, we showcase successful neural network pruning, i.e., we retain comparable classification accuracy with only 1.1\% of the neurons remaining from the state-of-the-art DNN architecture. Second, we demonstrate how a CycleGAN can be used to transform out-of-distribution images to the operational design domain. We posit that both pruning and CycleGANs are promising enablers for efficient edge AI in smart cameras.

SESep 11, 2020
The AIQ Meta-Testbed: Pragmatically Bridging Academic AI Testing and Industrial Q Needs

Markus Borg

AI solutions seem to appear in any and all application domains. As AI becomes more pervasive, the importance of quality assurance increases. Unfortunately, there is no consensus on what artificial intelligence means and interpretations range from simple statistical analysis to sentient humanoid robots. On top of that, quality is a notoriously hard concept to pinpoint. What does this mean for AI quality? In this paper, we share our working definition and a pragmatic approach to address the corresponding quality assurance with a focus on testing. Finally, we present our ongoing work on establishing the AIQ Meta-Testbed.

SEMay 27, 2020
Making Lab Sessions Mandatory -- On Student Work Distribution in a Gamified Project Course on Market-Driven Software Engineering

Markus Borg

Unfair work distribution in student teams is a common issue in project-based learning. One contributing factor is that students are differently skilled developers. In a course with group work intertwining engineering and business aspects, we designed an intervention to help novice programmers, i.e., we introduced mandatory programming lab sessions. However, the intervention did not affect the work distribution, showing that more is needed to balance the workload. Contrary to our goal, the intervention was very well received among experienced students, but unpopular with students weak at programming.

CYMay 26, 2020
Illuminating a Blind Spot in Digitalization -- Software Development in Sweden's Private and Public Sector

Markus Borg, Joakim Wernberg, Thomas Olsson et al.

As Netscape co-founder Marc Andreessen famously remarked in 2011, software is eating the world - becoming a pervasive invisible critical infrastructure. Data on the distribution of software use and development in society is scarce, but we compile results from two novel surveys to provide a fuller picture of the role software plays in the public and private sectors in Sweden, respectively. Three out of ten Swedish firms, across industry sectors, develop software in-house. The corresponding figure for Sweden's government agencies is four out of ten, i.e., the public sector should not be underestimated. The digitalization of society will continue, thus the demand for software developers will further increase. Many private firms report that the limited supply of software developers in Sweden is directly affecting their expansion plans. Based on our findings, we outline directions that need additional research to allow evidence-informed policy-making. We argue that such work should ideally be conducted by academic researchers and national statistics agencies in collaboration.

SEAug 19, 2019
An Autonomous Performance Testing Framework using Self-Adaptive Fuzzy Reinforcement Learning

Mahshid Helali Moghadam, Mehrdad Saadatmand, Markus Borg et al.

Test automation brings the potential to reduce costs and human effort, but several aspects of software testing remain challenging to automate. One such example is automated performance testing to find performance breaking points. Current approaches to tackle automated generation of performance test cases mainly involve using source code or system model analysis or use-case based techniques. However, source code and system models might not always be available at testing time. On the other hand, if the optimal performance testing policy for the intended objective in a testing process instead could be learned by the testing system, then test automation without advanced performance models could be possible. Furthermore, the learned policy could later be reused for similar software systems under test, thus leading to higher test efficiency. We propose SaFReL, a self-adaptive fuzzy reinforcement learning-based performance testing framework. SaFReL learns the optimal policy to generate performance test cases through an initial learning phase, then reuses it during a transfer learning phase, while keeping the learning running and updating the policy in the long term. Through multiple experiments on a simulated environment, we demonstrate that our approach generates the target performance test cases for different programs more efficiently than a typical testing process, and performs adaptively without access to source code and performance models.

LGAug 13, 2019
Requirements Engineering for Machine Learning: Perspectives from Data Scientists

Andreas Vogelsang, Markus Borg

Machine learning (ML) is used increasingly in real-world applications. In this paper, we describe our ongoing endeavor to define characteristics and challenges unique to Requirements Engineering (RE) for ML-based systems. As a first step, we interviewed four data scientists to understand how ML experts approach elicitation, specification, and assurance of requirements and expectations. The results show that changes in the development paradigm, i.e., from coding to training, also demands changes in RE. We conclude that development of ML systems demands requirements engineers to: (1) understand ML performance measures to state good functional requirements, (2) be aware of new quality requirements such as explainability, freedom from discrimination, or specific legal requirements, and (3) integrate ML specifics in the RE process. Our study provides a first contribution towards an RE methodology for ML systems.

SEMar 31, 2019
Video Game Development in a Rush: A Survey of the Global Game Jam Participants

Markus Borg, Vahid Garousi, Anas Mahmoud et al.

Video game development is a complex endeavor, often involving complex software, large organizations, and aggressive release deadlines. Several studies have reported that periods of "crunch time" are prevalent in the video game industry, but there are few studies on the effects of time pressure. We conducted a survey with participants of the Global Game Jam (GGJ), a 48-hour hackathon. Based on 198 responses, the results suggest that: (1) iterative brainstorming is the most popular method for conceptualizing initial requirements; (2) continuous integration, minimum viable product, scope management, version control, and stand-up meetings are frequently applied development practices; (3) regular communication, internal playtesting, and dynamic and proactive planning are the most common quality assurance activities; and (4) familiarity with agile development has a weak correlation with perception of success in GGJ. We conclude that GGJ teams rely on ad hoc approaches to development and face-to-face communication, and recommend some complementary practices with limited overhead. Furthermore, as our findings are similar to recommendations for software startups, we posit that game jams and the startup scene share contextual similarities. Finally, we discuss the drawbacks of systemic "crunch time" and argue that game jam organizers are in a good position to problematize the phenomenon.

SEMar 5, 2019
SZZ Unleashed: An Open Implementation of the SZZ Algorithm -- Featuring Example Usage in a Study of Just-in-Time Bug Prediction for the Jenkins Project

Markus Borg, Oscar Svensson, Kristian Berg et al.

Numerous empirical software engineering studies rely on detailed information about bugs. While issue trackers often contain information about when bugs were fixed, details about when they were introduced to the system are often absent. As a remedy, researchers often rely on the SZZ algorithm as a heuristic approach to identify bug-introducing software changes. Unfortunately, as reported in a recent systematic literature review, few researchers have made their SZZ implementations publicly available. Consequently, there is a risk that research effort is wasted as new projects based on SZZ output need to initially reimplement the approach. Furthermore, there is a risk that newly developed (closed source) SZZ implementations have not been properly tested, thus conducting research based on their output might introduce threats to validity. We present SZZ Unleashed, an open implementation of the SZZ algorithm for git repositories. This paper describes our implementation along with a usage example for the Jenkins project, and conclude with an illustrative study on just-in-time bug prediction. We hope to continue evolving SZZ Unleashed on GitHub, and warmly invite the community to contribute.

LGMar 4, 2019
Towards Structured Evaluation of Deep Neural Network Supervisors

Jens Henriksson, Christian Berger, Markus Borg et al.

Deep Neural Networks (DNN) have improved the quality of several non-safety related products in the past years. However, before DNNs should be deployed to safety-critical applications, their robustness needs to be systematically analyzed. A common challenge for DNNs occurs when input is dissimilar to the training set, which might lead to high confidence predictions despite proper knowledge of the input. Several previous studies have proposed to complement DNNs with a supervisor that detects when inputs are outside the scope of the network. Most of these supervisors, however, are developed and tested for a selected scenario using a specific performance metric. In this work, we emphasize the need to assess and compare the performance of supervisors in a structured way. We present a framework constituted by four datasets organized in six test cases combined with seven evaluation metrics. The test cases provide varying complexity and include data from publicly available sources as well as a novel dataset consisting of images from simulated driving scenarios. The latter we plan to make publicly available. Our framework can be used to support DNN supervisor evaluation, which in turn could be used to motive development, validation, and deployment of DNNs in safety-critical applications.

SEDec 13, 2018
Safely Entering the Deep: A Review of Verification and Validation for Machine Learning and a Challenge Elicitation in the Automotive Industry

Markus Borg, Cristofer Englund, Krzysztof Wnuk et al.

Deep Neural Networks (DNN) will emerge as a cornerstone in automotive software engineering. However, developing systems with DNNs introduces novel challenges for safety assessments. This paper reviews the state-of-the-art in verification and validation of safety-critical systems that rely on machine learning. Furthermore, we report from a workshop series on DNNs for perception with automotive experts in Sweden, confirming that ISO 26262 largely contravenes the nature of DNNs. We recommend aerospace-to-automotive knowledge transfer and systems-based safety approaches, e.g., safety cage architectures and simulated system test cases.

SEDec 4, 2018
Practical relevance of software engineering research: Synthesizing the community's voice

Vahid Garousi, Markus Borg, Markku Oivo

Software engineering (SE) research should be relevant to industrial practice. There have been regular discussions in the SE community on this issue since the 1980's, led by pioneers such as Robert Glass. As we recently passed the milestone of "50 years of software engineering", some recent positive efforts have been made in this direction, e.g., establishing "industrial" tracks in several SE conferences. However, many researchers and practitioners believe that we, as a community, are still struggling with research relevance and utility. The goal of this paper is to synthesize the evidence and experience-based opinions shared on this topic so far in the SE community, and to encourage the community to further reflect and act on the research relevance. For this purpose, we have conducted a Multi-vocal Literature Review (MLR) of 54 systematically-selected sources (papers and non peer-reviewed articles). Instead of relying on and considering the individual opinions on research relevance, mentioned in each of the sources, the MLR aims to synthesize and provide the "holistic" view on the topic. The highlights of our MLR findings are as follows. The top three root causes of low relevance, discussed in the community, are: (1) Researchers having simplistic views (or wrong assumptions) about SE in practice; (2) Lack of connection with industry; and (3) Wrong identification of research problems. The top three suggestions for improving research relevance are: (1) Using appropriate research approaches such as action-research; (2) Choosing relevant research problems; and (3) Collaborating with industry. By synthesizing all the discussions on this important topic so far, this paper aims to encourage further discussions and actions in the community to increase our collective efforts to improve the research relevance.

SEAug 29, 2018
Enabling Visual Design Verification Analytics - From Prototype Visualizations to an Analytics Tool using the Unity Game Engine

Markus Borg, Daniel Brytting, Daniel Hansson

The ever-increasing architectural complexity in contemporary ASIC projects turns Design Verification (DV) into a highly advanced endeavor. Pressing needs for short time-to-market has made automation a key solution in DV. However, recurring execution of large regression suites inevitably leads to challenging amounts of test results. Following the design science paradigm, we present an action research study to introduce visual analytics in a commercial ASIC project. We develop a cityscape visualization tool using the game engine Unity. Initial evaluations are promising, suggesting that the tool offers a novel approach to identify error-prone parts of the design, as well as coverage holes.

SEFeb 1, 2018
Digitalization of Swedish Government Agencies - A Perspective Through the Lens of a Software Development Census

Markus Borg, Thomas Olsson, Ulrik Franke et al.

Software engineering is at the core of the digitalization of society. Ill-informed decisions can have major consequences, as made evident in the 2017 government crisis in Sweden, originating in a data breach caused by an outsourcing deal made by the Swedish Transport Agency. Many Government Agencies (GovAgs) in Sweden are rapidly undergoing a digital transition, thus it is important to overview how widespread, and mature, software development is in this part of the public sector. We present a software development census of Swedish GovAgs, complemented by document analysis and a survey. We show that 39.2% of the GovAgs develop software internally, some matching the number of developers in large companies. Our findings suggest that the development largely resembles private sector counterparts, and that established best practices are implemented. Still, we identify improvement potential in the areas of strategic sourcing, openness, collaboration across GovAgs, and quality requirements. The Swedish Government has announced the establishment of a new digitalization agency next year, and our hope is that the software engineering community will contribute its expertise with a clear voice.

SEMay 15, 2017
Piggybacking on an Autonomous Hauler: Business Models Enabling a System-of-Systems Approach to Mapping an Underground Mine

Markus Borg, Thomas Olsson, John Svensson

With ever-increasing productivity targets in mining operations, there is a growing interest in mining automation. In future mines, remote-controlled and autonomous haulers will operate underground guided by LiDAR sensors. We envision reusing LiDAR measurements to maintain accurate mine maps that would contribute to both safety and productivity. Extrapolating from a pilot project on reliable wireless communication in Boliden's Kankberg mine, we propose establishing a system-of-systems (SoS) with LIDAR-equipped haulers and existing mapping solutions as constituent systems. SoS requirements engineering inevitably adds a political layer, as independent actors are stakeholders both on the system and SoS levels. We present four SoS scenarios representing different business models, discussing how development and operations could be distributed among Boliden and external stakeholders, e.g., the vehicle suppliers, the hauling company, and the developers of the mapping software. Based on eight key variation points, we compare the four scenarios from both technical and business perspectives. Finally, we validate our findings in a seminar with participants from the relevant stakeholders. We conclude that to determine which scenario is the most promising for Boliden, trade-offs regarding control, costs, risks, and innovation must be carefully evaluated.

CLApr 26, 2017
On Using Active Learning and Self-Training when Mining Performance Discussions on Stack Overflow

Markus Borg, Iben Lennerstad, Rasmus Ros et al.

Abundant data is the key to successful machine learning. However, supervised learning requires annotated data that are often hard to obtain. In a classification task with limited resources, Active Learning (AL) promises to guide annotators to examples that bring the most value for a classifier. AL can be successfully combined with self-training, i.e., extending a training set with the unlabelled examples for which a classifier is the most certain. We report our experiences on using AL in a systematic manner to train an SVM classifier for Stack Overflow posts discussing performance of software components. We show that the training examples deemed as the most valuable to the classifier are also the most difficult for humans to annotate. Despite carefully evolved annotation criteria, we report low inter-rater agreement, but we also propose mitigation strategies. Finally, based on one annotator's work, we show that self-training can improve the classification accuracy. We conclude the paper by discussing implication for future text miners aspiring to use AL and self-training.

SEMar 6, 2017
Software Engineers' Information Seeking Behavior in Change Impact Analysis - An Interview Study

Markus Borg, Emil Alégroth, Per Runeson

Software engineers working in large projects must navigate complex information landscapes. Change Impact Analysis (CIA) is a task that relies on engineers' successful information seeking in databases storing, e.g., source code, requirements, design descriptions, and test case specifications. Several previous approaches to support information seeking are task-specific, thus understanding engineers' seeking behavior in specific tasks is fundamental. We present an industrial case study on how engineers seek information in CIA, with a particular focus on traceability and development artifacts that are not source code. We show that engineers have different information seeking behavior, and that some do not consider traceability particularly useful when conducting CIA. Furthermore, we observe a tendency for engineers to prefer less rigid types of support rather than formal approaches, i.e., engineers value support that allows flexibility in how to practically conduct CIA. Finally, due to diverse information seeking behavior, we argue that future CIA support should embrace individual preferences to identify change impact by empowering several seeking alternatives, including searching, browsing, and tracing.

SEFeb 13, 2017
From LiDAR to Underground Maps via 5G - Business Models Enabling a System-of-Systems Approach to Mapping the Kankberg Mine

Markus Borg, Thomas Olsson, John Svensson

With ever-increasing productivity targets in mining operations, there is a growing interest in mining automation. The PIMM project addresses the fundamental challenge of network communication by constructing a pilot 5G network in the underground mine Kankberg. In this report, we discuss how such a 5G network could constitute the essential infrastructure to organize existing systems in Kankberg into a system-of-systems (SoS). In this report, we analyze a scenario in which LiDAR equipped vehicles operating in the mine are connected to existing mine mapping and positioning solutions. The approach is motivated by the approaching era of remote controlled, or even autonomous, vehicles in mining operations. The proposed SoS could ensure continuously updated maps of Kankberg, rendered in unprecedented detail, supporting both productivity and safety in the underground mine. We present four different SoS solutions from an organizational point of view, discussing how development and operations of the constituent systems could be distributed among Boliden and external stakeholders, e.g., the vehicle suppliers, the hauling company, and the developers of the mapping software. The four scenarios are compared from both technical and business perspectives, and based on trade-off discussions and SWOT analyses. We conclude our report by recommending continued research along two future paths, namely a closer cooperation with the vehicle suppliers, and further feasibility studies regarding establishing a Kankberg software ecosystem.

SENov 13, 2016
An Industrial Case Study on Measuring the Quality of the Requirements Scoping Process

Krzysztof Wnuk, Markus Borg, Sardar Muhammad Sulaman

Decision making and requirements scoping occupy central roles in helping to develop products that are demanded by the customers and ensuring company strategies are accurately realized in product scope. Many companies experience continuous and frequent scope changes and fluctuations but struggle to measure the phenomena and correlate the measurement to the quality of the requirements process. We present the results from an exploratory interview study among 22 participants working with requirements management processes at a large company that develops embedded systems for a global market. Our respondents shared their opinions about the current set of requirements management process metrics as well as what additional metrics they envisioned as useful. We present a set of metrics that describe the quality of the requirements scoping process. The findings provide practical insights that can be used as input when introducing new measurement programs for requirements management and decision making.

SEMay 23, 2016
Practitioners' Perspectives on Change Impact Analysis for Safety-Critical Software - A Preliminary Analysis

Markus Borg, José-Luis de la Vara, Krzysztof Wnuk

Safety standards prescribe change impact analysis (CIA) during evolution of safety-critical software systems. Although CIA is a fundamental activity, there is a lack of empirical studies about how it is performed in practice. We present a case study on CIA in the context of an evolving automation system, based on 14 interviews in Sweden and India. Our analysis suggests that engineers on average spend 50-100 hours on CIA per year, but the effort varies considerably with the phases of projects. Also, the respondents presented different connotations to CIA and perceived the importance of CIA differently. We report the most pressing CIA challenges, and several ideas on how to support future CIA. However, we show that measuring the effect of such improvement solutions is non-trivial, as CIA is intertwined with other development activities. While this paper only reports preliminary results, our work contributes empirical insights into practical CIA.

SEFeb 24, 2016
Advancing Trace Recovery Evaluation - Applied Information Retrieval in a Software Engineering Context

Markus Borg

Successful development of software systems involves efficient navigation among software artifacts. One state-of-practice approach to structure information is to establish trace links between artifacts, a practice that is also enforced by several development standards. Unfortunately, manually maintaining trace links in an evolving system is a tedious task. To tackle this issue, several researchers have proposed treating the capture and recovery of trace links as an Information Retrieval (IR) problem. The work contains a Systematic Literature Review (SLR) of previous evaluations of IR-based trace recovery. We show that a majority of previous evaluations have been technology-oriented, conducted in "the cave of IR evaluation", using small datasets as experimental input. Also, software artifacts originating from student projects have frequently been used in evaluations. We conducted a survey among traceability researchers, and found that a majority consider student artifacts to be only partly representative to industrial counterparts. Our findings call for additional case studies to evaluate IR-based trace recovery within the full complexity of an industrial setting. Also, this thesis contributes to the body of empirical evidence of IR-based trace recovery in two experiments with industrial software artifacts. The technology-oriented experiment highlights the clear dependence between datasets and the accuracy of IR-based trace recovery, in line with findings from the SLR. The human-oriented experiment investigates how different quality levels of tool output affect the tracing accuracy of engineers. Finally, we present how tools and methods are evaluated in the general field of IR research, and propose a taxonomy of evaluation contexts tailored for IR-based trace recovery.

SEFeb 17, 2016
Testing Quality Requirements of a System-of-Systems in the Public Sector - Challenges and Potential Remedies

Jacob Larsson, Markus Borg, Thomas Olsson

Quality requirements is a difficult concept in software projects, and testing software qualities is a well-known challenge. Without proper management of quality requirements, there is an increased risk that the software product under development will not meet the expectations of its future users. In this paper, we share experiences from testing quality requirements when developing a large system-of-systems in the public sector in Sweden. We complement the experience reporting by analyzing documents from the case under study. As a final step, we match the identified challenges with solution proposals from the literature. We report five main challenges covering inadequate requirements engineering and disconnected test managers. Finally, we match the challenges to solutions proposed in the scientific literature, including integrated requirements engineering, the twin peaks model, virtual plumblines, the QUPER model, and architecturally significant requirements. Our experiences are valuable to other large development projects struggling with testing of quality requirements. Furthermore, the report could be used by as input to process improvement activities in the case under study.

SEOct 10, 2014
Workshop Summary of the 1st International Workshop on Requirements and Testing (RET'14)

Michael Felderer, Elizabeth Bjarnason, Markus Borg et al.

The main objective of the RET workshop was to explore the interaction of Requirements Engineering (RE) and Testing, i.e. RET, in research and industry, and the challenges that result from this interaction. While much work has been done in the respective fields of requirements engineering and testing, there exists much more than can be done to understand the connection between the processes of RE and of testing.