Alexander Egyed

SE
16papers
143citations
Novelty33%
AI Score39

16 Papers

32.4SEMay 12Code
An Extensive Replication Study of the ABLoTS Approach for Bug Localization

Feifei Niu, Enshuo Zhang, Christoph Mayr-Dorn et al.

Bug localization is the task of recommending source code locations (typically files) that contain the cause of a bug and hence need to be changed to fix the bug. Along these lines, information retrieval-based bug localization (IRBL) approaches have been adopted, which identify the most bug-prone files from the source code space. In current practice, a series of state-of-the-art IRBL techniques leverage the combination of different components (e.g., similar reports, version history, and code structure) to achieve better performance. ABLoTS is a recently proposed approach with the core component, TraceScore, that utilizes requirements and traceability information between different issue reports (i.e., feature requests and bug reports) to identify buggy source code snippets with promising results. To evaluate the accuracy of these results and obtain additional insights into the practical applicability of ABLoTS, we conducted a replication study of this approach with the original dataset and also on two extended datasets (i.e., additional Java dataset and Python dataset). The original dataset consists of 11 open source Java projects with 8,494 bug reports. The extended Java dataset includes 16 more projects comprising 25,893 bug reports and corresponding source code commits. The extended Python dataset consists of 12 projects with 1,289 bug reports. While we find that the TraceScore component, which is the core of ABLoTS, produces comparable or even better results with the extended datasets, we also find that we cannot reproduce the ABLoTS results, as reported in its original paper, due to an overlooked side effect of incorrectly choosing a cut-off date that led to test data leaking into training data with significant effects on performance.

SEApr 8, 2021Code
Do Communities in Developer Interaction Networks align with Subsystem Developer Teams? An Empirical Study of Open Source Systems

Usman Ashraf, Christoph Mayr-Dorn, Atif Mashkoor et al.

Studies over the past decade demonstrated that developers contributing to open source software systems tend to self-organize in "emerging" communities. This latent community structure has a significant impact on software quality. While several approaches address the analysis of developer interaction networks, the question of whether these emerging communities align with the developer teams working on various subsystems remains unanswered. Work on socio-technical congruence implies that people that work on the same task or artifact need to coordinate and thus communicate, potentially forming stronger interaction ties. Our empirical study of 10 open source projects revealed that developer communities change considerably across a project's lifetime (hence implying that relevant relations between developers change) and that their alignment with subsystem developer teams is mostly low. However, subsystems teams tend to remain more stable. These insights are useful for practitioners and researchers to better understand developer interaction structure of open source systems.

SEMar 27, 2021
Team-oriented Consistency Checking of Heterogeneous Engineering Artifacts

Michael Alexander Tröls, Atif Mashkoor, Alexander Egyed

Consistency checking of interdependent heterogeneous engineering artifacts, such as requirements, specifications, and code, is a challenging task in large-scale engineering projects. The lack of team-oriented solutions allowing a multitude of project stakeholders to collaborate in a consistent manner is thus becoming a critical problem. In this context, this work proposes an approach for team-oriented consistency checking of collaboratively developed heterogeneous engineering artifacts.

SEMar 3, 2021
TaskAllocator: A Recommendation Approach for Role-based Tasks Allocation in Agile Software Development

Saad Shafiq, Atif Mashkoor, Christoph Mayr-Dorn et al.

In this paper, we propose a recommendation approach -- TaskAllocator -- in order to predict the assignment of incoming tasks to potential befitting roles. The proposed approach, identifying team roles rather than individual persons, allows project managers to perform better tasks allocation in case the individual developers are over-utilized or moved on to different roles/projects. We evaluated our approach on ten agile case study projects obtained from the Taiga.io repository. In order to determine the TaskAllocator's performance, we have conducted a benchmark study by comparing it with contemporary machine learning models. The applicability of the TaskAllocator was assessed through a plugin that can be integrated with JIRA and provides recommendations about suitable roles whenever a new task is added to the project. Lastly, the source code of the plugin and the dataset employed have been made public.

SEFeb 11, 2021
Validation Obligations: A Novel Approach to Check Compliance between Requirements and their Formal Specification

Atif Mashkoor, Michael Leuschel, Alexander Egyed

Traditionally, practitioners use formal methods pre-dominately for one half of the quality-assurance process: verification (do we build the software right?). The other half -- validation (do we build the right software?) -- has been given comparatively little attention. While verification is the core of refinement-based formal methods, where each new refinement step must preserve all properties of its abstract model, validation is usually postponed until the latest stages of the development, when models can be automatically executed. Thus mistakes in requirements or in their interpretation are caught too late: usually at the end of the development process. In this paper, we present a novel approach to check compliance between requirements and their formal refinement-based specification during the earlier stages of development. Our proposed approach -- "validation obligations" -- is based on the simple idea that both verification and validation are an integral part of all refinement steps of a system.

SEAug 11, 2020
Semantic Clone Detection via Probabilistic Software Modeling

Hannes Thaller, Lukas Linsbauer, Alexander Egyed

Semantic clone detection is the process of finding program elements with similar or equal runtime behavior. For example, detecting the semantic equality between the recursive and iterative implementation of the factorial computation. Semantic clone detection is the de facto technical boundary of clone detectors. In recent years, this boundary has been tested using interesting new approaches. This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity. We present Semantic Clone Detection via Probabilistic Software Modeling (SCD-PSM) as a stable and precise solution to semantic clone detection. PSM builds a probabilistic model of a program that is capable of evaluating and generating runtime data. SCD-PSM leverages this model and its model elements for finding behaviorally equal model elements. This behavioral equality is then generalized to semantic equality of the original program elements. It uses the likelihood between model elements as a distance metric. Then, it employs the likelihood ratio significance test to decide whether this distance is significant, given a pre-specified and controllable false-positive rate. The output of SCD-PSM are pairs of program elements (i.e., methods), their distance, and a decision on whether they are clones or not. SCD-PSM yields excellent results with a Matthews Correlation Coefficient greater than 0.9. These results are obtained on classical semantic clone detection problems such as detecting recursive and iterative versions of an algorithm, but also on complex problems used in coding competitions.

SEMay 27, 2020
Machine Learning for Software Engineering: A Systematic Mapping

Saad Shafiq, Atif Mashkoor, Christoph Mayr-Dorn et al.

Context: The software development industry is rapidly adopting machine learning for transitioning modern day software systems towards highly intelligent and self-learning systems. However, the full potential of machine learning for improving the software engineering life cycle itself is yet to be discovered, i.e., up to what extent machine learning can help reducing the effort/complexity of software engineering and improving the quality of resulting software systems. To date, no comprehensive study exists that explores the current state-of-the-art on the adoption of machine learning across software engineering life cycle stages. Objective: This article addresses the aforementioned problem and aims to present a state-of-the-art on the growing number of uses of machine learning in software engineering. Method: We conduct a systematic mapping study on applications of machine learning to software engineering following the standard guidelines and principles of empirical software engineering. Results: This study introduces a machine learning for software engineering (MLSE) taxonomy classifying the state-of-the-art machine learning techniques according to their applicability to various software engineering life cycle stages. Overall, 227 articles were rigorously selected and analyzed as a result of this study. Conclusion: From the selected articles, we explore a variety of aspects that should be helpful to academics and practitioners alike in understanding the potential of adopting machine learning techniques during software engineering projects.

SEApr 17, 2020
Model-driven Engineering of Safety and Security Systems: A Systematic Mapping Study

Atif Mashkoor, Alexander Egyed, Robert Wille

This paper presents a systematic mapping study on the model-driven engineering of safety and security concerns in systems. Integrated modeling and development of both safety and security concerns is an emerging field of research. Our mapping study provides an overview of the current state-of-the-art in this field. Through a rigorous and systematic process, this study carefully selected 95 publications out of 17,927 relevant papers published between 1992 and 2018. This paper then proposes and answers several relevant research questions about frequently used methods, development stages where these concerns are typically investigated in, or application domains. Additionally, we identify the community's preference for publication venues and trends.

SEJan 21, 2020
Towards Fault Localization via Probabilistic Software Modeling

Hannes Thaller, Lukas Linsbauer, Alexander Egyed et al.

Software testing helps developers to identify bugs. However, awareness of bugs is only the first step. Finding and correcting the faulty program components is equally hard and essential for high-quality software. Fault localization automatically pinpoints the location of an existing bug in a program. It is a hard problem, and existing methods are not yet precise enough for widespread industrial adoption. We propose fault localization via Probabilistic Software Modeling (PSM). PSM analyzes the structure and behavior of a program and synthesizes a network of Probabilistic Models (PMs). Each PM models a method with its inputs and outputs and is capable of evaluating the likelihood of runtime data. We use this likelihood evaluation to find fault locations and their impact on dependent code elements. Results indicate that PSM is a robust framework for accurate fault localization.

SEJan 21, 2020
Towards Semantic Clone Detection via Probabilistic Software Modeling

Hannes Thaller, Lukas Linsbauer, Alexander Egyed

Semantic clones are program components with similar behavior, but different textual representation. Semantic similarity is hard to detect, and semantic clone detection is still an open issue. We present semantic clone detection via Probabilistic Software Modeling (PSM) as a robust method for detecting semantically equivalent methods. PSM inspects the structure and runtime behavior of a program and synthesizes a network of Probabilistic Models (PMs). Each PM in the network represents a method in the program and is capable of generating and evaluating runtime events. We leverage these capabilities to accurately find semantic clones. Results show that the approach can detect semantic clones in the complete absence of syntactic similarity with high precision and low error rates.

SEDec 17, 2019
Probabilistic Software Modeling: A Data-driven Paradigm for Software Analysis

Hannes Thaller, Lukas Linsbauer, Rudolf Ramler et al.

Software systems are complex, and behavioral comprehension with the increasing amount of AI components challenges traditional testing and maintenance strategies.The lack of tools and methodologies for behavioral software comprehension leaves developers to testing and debugging that work in the boundaries of known scenarios. We present Probabilistic Software Modeling (PSM), a data-driven modeling paradigm for predictive and generative methods in software engineering. PSM analyzes a program and synthesizes a network of probabilistic models that can simulate and quantify the original program's behavior. The approach extracts the type, executable, and property structure of a program and copies its topology. Each model is then optimized towards the observed runtime leading to a network that reflects the system's structure and behavior. The resulting network allows for the full spectrum of statistical inferential analysis with which rich predictive and generative applications can be built. Applications range from the visualization of states, inferential queries, test case generation, and anomaly detection up to the stochastic execution of the modeled system. In this work, we present the modeling methodologies, an empirical study of the runtime behavior of software systems, and a comprehensive study on PSM modeled systems. Results indicate that PSM is a solid foundation for structural and behavioral software comprehension applications.

SEDec 24, 2018
Feature Maps: A Comprehensible Software Representation for Design Pattern Detection

Hannes Thaller, Lukas Linsbauer, Alexander Egyed

Design patterns are elegant and well-tested solutions to recurrent software development problems. They are the result of software developers dealing with problems that frequently occur, solving them in the same or a slightly adapted way. A pattern's semantics provide the intent, motivation, and applicability, describing what it does, why it is needed, and where it is useful. Consequently, design patterns encode a well of information. Developers weave this information into their systems whenever they use design patterns to solve problems. This work presents Feature Maps, a flexible human- and machine-comprehensible software representation based on micro-structures. Our algorithm, the Feature-Role Normalization, presses the high-dimensional, inhomogeneous vector space of micro-structures into a feature map. We apply these concepts to the problem of detecting instances of design patterns in source code. We evaluate our methodology on four design patterns, a wide range of balanced and imbalanced labeled training data, and compare classical machine learning (Random Forests) with modern deep learning approaches (Convolutional Neural Networks). Feature maps yield robust classifiers even under challenging settings of strongly imbalanced data distributions without sacrificing human comprehensibility. Results suggest that feature maps are an excellent addition in the software analysis toolbox that can reveal useful information hidden in the source code.

SEJun 13, 2017
Exploring Code Clones in Programmable Logic Controller Software

Hannes Thaller, Rudolf Ramler, Josef Pichler et al.

The reuse of code fragments by copying and pasting is widely practiced in software development and results in code clones. Cloning is considered an anti-pattern as it negatively affects program correctness and increases maintenance efforts. Programmable Logic Controller (PLC) software is no exception in the code clone discussion as reuse in development and maintenance is frequently achieved through copy, paste, and modification. Even though the presence of code clones may not necessary be a problem per se, it is important to detect, track and manage clones as the software system evolves. Unfortunately, tool support for clone detection and management is not commonly available for PLC software systems or limited to generic tools with a reduced set of features. In this paper, we investigate code clones in a real-world PLC software system based on IEC 61131-3 Structured Text and C/C++. We extended a widely used tool for clone detection with normalization support. Furthermore, we evaluated the different types and natures of code clones in the studied system and their relevance for refactoring. Results shed light on the applicability and usefulness of clone detection in the context of industrial automation systems and it demonstrates the benefit of adapting detection and management tools for IEC 611313-3 languages.

SEJun 11, 2014
A Hitchhiker's Guide to Search-Based Software Engineering for Software Product Lines

Roberto E. Lopez-Herrejon, Javier Ferrer, Francisco Chicano et al.

Search Based Software Engineering (SBSE) is an emerging discipline that focuses on the application of search-based optimization techniques to software engineering problems. The capacity of SBSE techniques to tackle problems involving large search spaces make their application attractive for Software Product Lines (SPLs). In recent years, several publications have appeared that apply SBSE techniques to SPL problems. In this paper, we present the results of a systematic mapping study of such publications. We identified the stages of the SPL life cycle where SBSE techniques have been used, what case studies have been employed and how they have been analysed. This mapping study revealed potential venues for further research as well as common misunderstanding and pitfalls when applying SBSE techniques that we address by providing a guideline for researchers and practitioners interested in exploiting these techniques.

SEJan 21, 2014
Towards a Benchmark and a Comparison Framework for Combinatorial Interaction Testing of Software Product Lines

Roberto E. Lopez-Herrejon, Javier Ferrer, Francisco Chicano et al.

As Software Product Lines (SPLs) are becoming a more pervasive development practice, their effective testing is becoming a more important concern. In the past few years many SPL testing approaches have been proposed, among them, are those that support Combinatorial Interaction Testing (CIT) whose premise is to select a group of products where faults, due to feature interactions, are more likely to occur. Many CIT techniques for SPL testing have been put forward; however, no systematic and comprehensive comparison among them has been performed. To achieve such goal two items are important: a common benchmark of feature models, and an adequate comparison framework. In this research-in-progress paper, we propose 19 feature models as the base of a benchmark, which we apply to three different techniques in order to analyze the comparison framework proposed by Perrouin et al. We identify the shortcomings of this framework and elaborate alternatives for further study.

SENov 28, 2013
Improving CASA Runtime Performance by Exploiting Basic Feature Model Analysis

Evelyn Nicole Haslinger, Roberto E. Lopez-Herrejon, Alexander Egyed

In Software Product Line Engineering (SPLE) families of systems are designed, rather than developing the individual systems independently. Combinatorial Interaction Testing has proven to be effective for testing in the context of SPLE, where a representative subset of products is chosen for testing in place of the complete family. Such a subset of products can be determined by computing a so called t-wise Covering Array (tCA), whose computation is NP-complete. Recently, reduction rules that exploit basic feature model analysis have been proposed that reduce the number of elements that need to be considered during the computation of tCAs for Software Product Lines (SPLs). We applied these rules to CASA, a simulated annealing algorithm for tCA generation for SPLs. We evaluated the adapted version of CASA using 133 publicly available feature models and could record on average a speedup of $61.8\%$ of median execution time, while at the same time preserving the coverage of the generated array.