Felix Dobslaw

SE
h-index12
7papers
24citations
Novelty35%
AI Score31

7 Papers

SEMay 2, 2025Code
Automatic techniques for issue report classification: A systematic mapping study

Muhammad Laiq, Felix Dobslaw

Several studies have evaluated automatic techniques for classifying software issue reports to assist practitioners in effectively assigning relevant resources based on the type of issue. Currently, no comprehensive overview of this area has been published. A comprehensive overview will help identify future research directions and provide an extensive collection of potentially relevant existing solutions. This study aims to provide a comprehensive overview of the use of automatic techniques to classify issue reports. We conducted a systematic mapping study and identified 46 studies on the topic. The study results indicate that the existing literature applies various techniques for classifying issue reports, including traditional machine learning and deep learning-based techniques and more advanced large language models. Furthermore, we observe that these studies (a) lack the involvement of practitioners, (b) do not consider other potentially relevant adoption factors beyond prediction accuracy, such as the explainability, scalability, and generalizability of the techniques, and (c) mainly rely on archival data from open-source repositories only. Therefore, future research should focus on real industrial evaluations, consider other potentially relevant adoption factors, and actively involve practitioners.

SEMar 1, 2025Code
Challenges in Testing Large Language Model Based Software: A Faceted Taxonomy

Felix Dobslaw, Robert Feldt, Juyeon Yoon et al.

Large Language Models (LLMs) and Multi-Agent LLMs (MALLMs) introduce non-determinism unlike traditional or machine learning software, requiring new approaches to verifying correctness beyond simple output comparisons or statistical accuracy over test datasets. This paper presents a taxonomy for LLM test case design, informed by research literature and our experience. Each facet is exemplified, and we conduct an LLM-assisted analysis of six open-source testing frameworks, perform a sensitivity study of an agent-based system across different model configurations, and provide working examples contrasting atomic and aggregated test cases. We identify key variation points that impact test correctness and highlight open challenges that the research, industry, and open-source communities must address as LLMs become integral to software systems. Our taxonomy defines four facets of LLM test case design, addressing ambiguity in both inputs and outputs while establishing best practices. It distinguishes variability in goals, the system under test, and inputs, and introduces two key oracle types: atomic and aggregated. Our findings reveal that current tools treat test executions as isolated events, lack explicit aggregation mechanisms, and inadequately capture variability across model versions, configurations, and repeated runs. This highlights the need for viewing correctness as a distribution of outcomes rather than a binary property, requiring closer collaboration between academia and practitioners to establish mature, variability-aware testing methodologies.

SEOct 28, 2021Code
On the Importance and Shortcomings of Code Readability Metrics: A Case Study on Reactive Programming

Gustaf Holst, Felix Dobslaw

Well structured and readable source code is a pre-requisite for maintainable software and successful collaboration among developers. Static analysis enables the automated extraction of code complexity and readability metrics which can be leveraged to highlight potential improvements in code to both attain software of high quality and reinforce good practices for developers as an educational tool. This assumes reliable readability metrics which are not trivial to obtain since code readability is somewhat subjective. Recent research has resulted in increasingly sophisticated models for predicting readability as perceived by humans primarily with a procedural and object oriented focus, while functional and declarative languages and language extensions advance as they often are said to lead to more concise and readable code. In this paper, we investigate whether the existing complexity and readability metrics reflect that wisdom or whether the notion of readability and its constituents requires overhaul in the light of programming language changes. We therefore compare traditional object oriented and reactive programming in terms of code complexity and readability in a case study. Reactive programming is claimed to increase code quality but few studies have substantiated these claims empirically. We refactored an object oriented open source project into a reactive candidate and compare readability with the original using cyclomatic complexity and two state-of-the-art readability metrics. More elaborate investigations are required, but our findings suggest that both cyclomatic complexity and readability decrease significantly at the same time in the reactive candidate, which seems counter-intuitive. We exemplify and substantiate why readability metrics may require adjustment to better suit popular programming styles other than imperative and object-oriented to better match human expectations.

SEOct 19, 2020Code
Using mutation testing to measure behavioural test diversity

Francisco Gomes de Oliveira Neto, Felix Dobslaw, Robert Feldt

Diversity has been proposed as a key criterion to improve testing effectiveness and efficiency.It can be used to optimise large test repositories but also to visualise test maintenance issues and raise practitioners' awareness about waste in test artefacts and processes. Even though these diversity-based testing techniques aim to exercise diverse behavior in the system under test (SUT), the diversity has mainly been measured on and between artefacts (e.g., inputs, outputs or test scripts). Here, we introduce a family of measures to capture behavioural diversity (b-div) of test cases by comparing their executions and failure outcomes. Using failure information to capture the SUT behaviour has been shown to improve effectiveness of history-based test prioritisation approaches. However, history-based techniques require reliable test execution logs which are often not available or can be difficult to obtain due to flaky tests, scarcity of test executions, etc. To be generally applicable we instead propose to use mutation testing to measure behavioral diversity by running the set of test cases on various mutated versions of the SUT. Concretely, we propose two specific b-div measures (based on accuracy and Matthew's correlation coefficient, respectively) and compare them with artefact-based diversity (a-div) for prioritising the test suites of 6 different open-source projects. Our results show that our b-div measures outperform a-div and random selection in all of the studied projects. The improvement is substantial with an average increase in average percentage of faults detected (APFD) of between 19% to 31% depending on the size of the subset of prioritised tests.

SEJan 18, 2020
Boundary Value Exploration for Software Analysis

Felix Dobslaw, Francisco Gomes de Oliveira Neto, Robert Feldt

For software to be reliable and resilient, it is widely accepted that tests must be created and maintained alongside the software itself. One safeguard from vulnerabilities and failures in code is to ensure correct behavior on the boundaries between the input space sub-domains. So-called boundary value analysis (BVA) and boundary value testing (BVT) techniques aim to exercise those boundaries and increase test effectiveness. However, the concepts of BVA and BVT themselves are not generally well defined, and it is not clear how to identify relevant sub-domains, and thus the boundaries delineating them, given a specification. This has limited adoption and hindered automation. We clarify BVA and BVT and introduce Boundary Value Exploration (BVE) to describe techniques that support them by helping to detect and identify boundary inputs. Additionally, we propose two concrete BVE techniques based on information-theoretic distance functions: (i) an algorithm for boundary detection and (ii) the usage of software visualization to explore the behavior of the software under test and identify its boundary behavior. As an initial evaluation, we apply these techniques on a much used and well-tested date handling library. Our results reveal questionable behavior at boundaries highlighted by our techniques. In conclusion, we argue that the boundary value exploration that our techniques enable is a step towards automated boundary value analysis and testing, fostering their wider use and improving test effectiveness and efficiency.

SEJul 8, 2019
Estimating Return on Investment for GUI Test Automation Tools

Felix Dobslaw, Robert Feldt, David Michaelsson et al.

Automated graphical user interface (GUI) tests can reduce manual testing activities and increase test frequency. This motivates the conversion of manual test cases into automated GUI tests. However, it is not clear whether such automation is cost-effective given that GUI automation scripts add to the code base and demand maintenance as a system evolves. In this paper, we introduce a method for estimating maintenance cost and Return on Investment (ROI) for Automated GUI Testing (AGT). The method utilizes the existing source code change history and can be used for evaluation also of other testing or quality assurance automation technologies. We evaluate the method for a real-world, industrial software system and compare two fundamentally different AGT tools, namely Selenium and EyeAutomate, to estimate and compare their ROI. We also report on their defect-finding capabilities and usability. The quantitative data is complemented by interviews with employees at the case company. The method was successfully applied and estimated maintenance cost and ROI for both tools are reported. Overall, the study supports earlier results showing that implementation time is the leading cost for introducing AGT. The findings further suggest that while EyeAutomate tests are significantly faster to implement, Selenium tests require more of a programming background but less maintenance.

SEMay 27, 2019
Towards Automated Boundary Value Testing with Program Derivatives and Search

Robert Feldt, Felix Dobslaw

A natural and often used strategy when testing software is to use input values at boundaries, i.e. where behavior is expected to change the most, an approach often called boundary value testing or analysis (BVA). Even though this has been a key testing idea for long it has been hard to clearly define and formalize. Consequently, it has also been hard to automate. In this research note we propose one such formalization of BVA by, in a similar way as to how the derivative of a function is defined in mathematics, considering (software) program derivatives. Critical to our definition is the notion of distance between inputs and outputs which we can formalize and then quantify based on ideas from Information theory. However, for our (black-box) approach to be practical one must search for test inputs with specific properties. Coupling it with search-based software engineering is thus required and we discuss how program derivatives can be used as and within fitness functions. This brief note does not allow a deeper, empirical investigation but we use a simple illustrative example throughout to introduce the main ideas. By combining program derivatives with search, we thus propose a practical as well as theoretically interesting technique for automated boundary value (analysis and) testing.