LGFeb 11
Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural NetworksEnrico Ahlers, Daniel Passon, Yannic Noller et al.
Machine learning models are increasingly present in our everyday lives; as a result, they become targets of adversarial attackers seeking to manipulate the systems we interact with. A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process. Backdoors can be used to induce unwanted behavior by including a certain trigger in the input. Existing mitigations filter training data, modify the model, or perform expensive input modifications on samples. If a vulnerable model has already been deployed, however, those strategies are either ineffective or inefficient. To address this gap, we propose our inference-time backdoor mitigation approach called FIRE (Feature-space Inference-time REpair). We hypothesize that a trigger induces structured and repeatable changes in the model's internal representation. We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism. Therefore, we turn the backdoored model against itself by manipulating its latent representations and moving a poisoned sample's features along the backdoor directions to neutralize the trigger. Our evaluation shows that FIRE has low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.
41.5SEApr 21Code
CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test GenerationTobias Kiecker, Jan Arne Sparka, Martin Reuter et al.
Maintaining consistency between code and documentation is a crucial yet frequently overlooked aspect of software development. Even minor mismatches can confuse API users, introduce new bugs, and increase overall maintenance effort. This creates demand for automated solutions that can assist developers in identifying code-documentation inconsistencies. However, since automatic reports still require human confirmation, false positives carry serious consequences: wasting developer time and discouraging practical adoption. We introduce CASCADE (Consistency Analysis for Source Code And Documentation through Execution), a novel tool for detecting inconsistencies with a strong emphasis on reducing false positives. CASCADE leverages Large Language Models (LLMs) to generate unit tests directly from natural-language documentation. Since these tests are derived from the documentation, any failure during execution indicates a potential mismatch between the documented and actual behavior of the code. To minimize false positives, CASCADE also generates code from the documentation to cross-check the generated tests. By design, an inconsistency is reported only when two conditions are met: the existing code fails a test, while the code generated from the documentation passes the same test. We evaluated CASCADE on a novel dataset of 71 inconsistent and 814 consistent code-documentation pairs drawn from open-source Java projects. Further, we applied CASCADE to additional Java, C#, and Rust repositories, where we uncovered 13 previously unknown inconsistencies, of which 10 have subsequently been fixed, demonstrating both CASCADE's precision and its applicability to real-world codebases.
SEJul 16, 2021Code
Towards a Benchmark Set for Program Repair Based on Partial FixesDirk Beyer, Lars Grunske, Thomas Lemberger et al.
Software bugs significantly contribute to software cost and increase the risk of system malfunctioning. In recent years, many automated program-repair approaches have been proposed to automatically fix undesired program behavior. Despite of their great success, specific problems such as fixing bugs with partial fixes still remain unresolved. A partial fix to a known software issue is a programmer's failed attempt to fix the issue the first time. Even though it fails, this fix attempt still conveys important information such as the suspicious software region and the bug type. In this work we do not propose an approach for program repair with partial fixes, but instead answer a preliminary question: Do partial fixes occur often enough, in general, to be relevant for the research area of automated program repair? We crawled 1500 open-source C repositories on GitHub for partial fixes. The result is a benchmark set of 2204 benchmark tasks for automated program repair based on partial fixes. The benchmark set is available open source and open to further contributions and improvement.
SEJun 19, 2019Code
Does Diversity Improve the Test Suite Generation for Mobile Applications?Thomas Vogel, Chinh Tran, Lars Grunske
In search-based software engineering we often use popular heuristics with default configurations, which typically lead to suboptimal results, or we perform experiments to identify configurations on a trial-and-error basis, which may lead to better results for a specific problem. To obtain better results while avoiding trial-and-error experiments, a fitness landscape analysis is helpful in understanding the search problem, and making an informed decision about the heuristics. In this paper, we investigate the search problem of test suite generation for mobile applications (apps) using SAPIENZ whose heuristic is a default NSGA-II. We analyze the fitness landscape of SAPIENZ with respect to genotypic diversity and use the gained insights to adapt the heuristic of SAPIENZ. These adaptations result in SAPIENZ^div that aims for preserving the diversity of test suites during the search. To evaluate SAPIENZ^div, we perform a head-to-head comparison with SAPIENZ on 76 open-source apps.
SEFeb 26, 2022
BeDivFuzz: Integrating Behavioral Diversity into Generator-based FuzzingHoang Lam Nguyen, Lars Grunske
A popular metric to evaluate the performance of fuzzers is branch coverage. However, we argue that focusing solely on covering many different branches (i.e., the richness) is not sufficient since the majority of the covered branches may have been exercised only once, which does not inspire a high confidence in the reliability of the covered code. Instead, the distribution of the executed branches (i.e., the evenness) should also be considered. That is, behavioral diversity is only given if the generated inputs not only trigger many different branches, but also trigger them evenly often with diverse inputs. We introduce BeDivFuzz, a feedback-driven fuzzing technique for generator-based fuzzers. BeDivFuzz distinguishes between structure-preserving and structure-changing mutations in the space of syntactically valid inputs, and biases its mutation strategy towards validity and behavioral diversity based on the received program feedback. We have evaluated BeDivFuzz on Ant, Maven, Rhino, Closure, Nashorn, and Tomcat. The results show that BeDivFuzz achieves better behavioral diversity than the state of the art, measured by established biodiversity metrics, namely the Hill numbers, from the field of ecology.
CRJan 20, 2022
VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for PythonLaura Wartschinski, Yannic Noller, Thomas Vogel et al.
Context: Identifying potential vulnerable code is important to improve the security of our software systems. However, the manual detection of software vulnerabilities requires expert knowledge and is time-consuming, and must be supported by automated techniques. Objective: Such automated vulnerability detection techniques should achieve a high accuracy, point developers directly to the vulnerable code fragments, scale to real-world software, generalize across the boundaries of a specific software project, and require no or only moderate setup or configuration effort. Method: In this article, we present VUDENC (Vulnerability Detection with Deep Learning on a Natural Codebase), a deep learning-based vulnerability detection tool that automatically learns features of vulnerable code from a large and real-world Python codebase. VUDENC applies a word2vec model to identify semantically similar code tokens and to provide a vector representation. A network of long-short-term memory cells (LSTM) is then used to classify vulnerable code token sequences at a fine-grained level, highlight the specific areas in the source code that are likely to contain vulnerabilities, and provide confidence levels for its predictions. Results: To evaluate VUDENC, we used 1,009 vulnerability-fixing commits from different GitHub repositories that contain seven different types of vulnerabilities (SQL injection, XSS, Command injection, XSRF, Remote code execution, Path disclosure, Open redirect) for training. In the experimental evaluation, VUDENC achieves a recall of 78%-87%, a precision of 82%-96%, and an F1 score of 80%-90%. VUDENC's code, the datasets for the vulnerabilities, and the Python corpus for the word2vec model are available for reproduction. Conclusions: Our experimental results suggest...
SEJan 9, 2022
A systematic literature review on counterexample explanationArut Prakash Kaleeswaran, Arne Nordmann, Thomas Vogel et al.
Context: Safety is of paramount importance for cyber-physical systems in domains such as automotive, robotics, and avionics. Formal methods such as model checking are one way to ensure the safety of cyber-physical systems. However, adoption of formal methods in industry is hindered by usability issues, particularly the difficulty of understanding model checking results. Objective: We want to provide an overview of the state of the art for counterexample explanation by investigating the contexts, techniques, and evaluation of research approaches in this field. This overview shall provide an understanding of current and guide future research. Method: To provide this overview, we conducted a systematic literature review. The survey comprises 116 publications that address counterexample explanations for model checking. Results: Most primary studies provide counterexample explanations graphically or as traces, minimize counterexamples to reduce complexity, localize errors in the models expressed in the input formats of model checkers, support linear temporal logic or computation tree logic specifications, and use model checkers of the Symbolic Model Verifier family. Several studies evaluate their approaches in safety-critical domains with industrial applications. Conclusion: We notably see a lack of research on counterexample explanation that targets probabilistic and real-time systems, leverages the explanations to domain-specific models, and evaluates approaches in user studies. We conclude by discussing the adequacy of different types of explanations for users with varying domain and formal methods expertise, showing the need to support laypersons in understanding model checking results to increase adoption of formal methods in industry.
SEAug 13, 2021
A User-Study Protocol for Evaluation of Formal Verification Results and their ExplanationArut Prakash Kaleeswaran, Arne Nordmann, Thomas Vogel et al.
Context: The complexity of modern safety-critical systems in industries keep on increasing due to the rising number of features and functionalities. This calls for formal methods in order to entrust confidence in such systems. Nevertheless, using formal methods in industry is demanding because of usability issues, e.g., the difficulty of understanding model checking results. Thus the hypothesis is, presenting the result of model checker results in a user-friendly manner could promote formal methods usage in industries. Objective: We aim to evaluate the acceptance of formal methods by engineers if the complexity of understanding verification results is made easy. Method: We perform two different exploratory studies. First, we conduct an online survey to explore challenges in identifying inconsistent specifications and using formal methods from engineers. Second, we perform a one group pretest and posttest experiment to collect impressions from engineers using formal methods if understanding verification results is eased. Limitations: The main limitation of this study is the generalization because the survey focuses on a particular target group and it uses a pre-experimental design.
SEDec 28, 2020
A Comprehensive Empirical Evaluation of Generating Test Suites for Mobile Applications with DiversityThomas Vogel, Chinh Tran, Lars Grunske
Context: In search-based software engineering we often use popular heuristics with default configurations, which typically lead to suboptimal results, or we perform experiments to identify configurations on a trial-and-error basis, which may lead to better results for a specific problem. We consider the problem of generating test suites for mobile applications (apps) and rely on \Sapienz, a state-of-the-art approach to this problem that uses a popular heuristic (NSGA-II) with a default configuration. Objective: We want to achieve better results in generating test suites with \Sapienz while avoiding trial-and-error experiments to identify a more suitable configuration of \Sapienz. Method: We conducted a fitness landscape analysis of \Sapienz to analytically understand the search problem, which allowed us to make informed decisions about the heuristic and configuration of \Sapienz when developing \SapienzDiv. We comprehensively evaluated \SapienzDiv in a head-to-head comparison with \Sapienz on 34 apps. Results: Analyzing the fitness landscape of \Sapienz, we observed a lack of diversity of the evolved test suites and a stagnation of the search after 25 generations. \SapienzDiv realizes mechanisms that preserve the diversity of the test suites being evolved. The evaluation showed that \SapienzDiv achieves better or at least similar test results than \Sapienz concerning coverage and the number of revealed faults. However, \SapienzDiv typically produces longer test sequences and requires more execution time than \Sapienz. Conclusions: The understanding of the search problem obtained by the fitness landscape analysis helped us to find a more suitable configuration of \Sapienz without trial-and-error experiments. By promoting diversity of test suites during the search, improved or at least similar test results in terms of faults and coverage can be achieved.
SEAug 3, 2020
Bet and Run for Test Case GenerationSebastian Müller, Thomas Vogel, Lars Grunske
Anyone working in the technology sector is probably familiar with the question: "Have you tried turning it off and on again?", as this is usually the default question asked by tech support. Similarly, it is known in search based testing that metaheuristics might get trapped in a plateau during a search. As a human, one can look at the gradient of the fitness curve and decide to restart the search, so as to hopefully improve the results of the optimization with the next run. Trying to automate such a restart, it has to be programmatically decided whether the metaheuristic has encountered a plateau yet, which is an inherently difficult problem. To mitigate this problem in the context of theoretical search problems, the Bet and Run strategy was developed, where multiple algorithm instances are started concurrently, and after some time all but the single most promising instance in terms of fitness values are killed. In this paper, we adopt and evaluate the Bet and Run strategy for the problem of test case generation. Our work indicates that use of this restart strategy does not generally lead to gains in the quality metrics, when instantiated with the best parameters found in the literature.
SEAug 3, 2020
Evolutionary Grammar-Based FuzzingMartin Eberlein, Yannic Noller, Thomas Vogel et al.
A fuzzer provides randomly generated inputs to a targeted software to expose erroneous behavior. To efficiently detect defects, generated inputs should conform to the structure of the input format and thus, grammars can be used to generate syntactically correct inputs. In this context, fuzzing can be guided by probabilities attached to competing rules in the grammar, leading to the idea of probabilistic grammar-based fuzzing. However, the optimal assignment of probabilities to individual grammar rules to effectively expose erroneous behavior for individual systems under test is an open research question. In this paper, we present EvoGFuzz, an evolutionary grammar-based fuzzing approach to optimize the probabilities to generate test inputs that may be more likely to trigger exceptional behavior. The evaluation shows the effectiveness of EvoGFuzz in detecting defects compared to probabilistic grammar-based fuzzing (baseline). Applied to ten real-world applications with common input formats (JSON, JavaScript, or CSS3), the evaluation shows that EvoGFuzz achieved a significantly larger median line coverage for all subjects by up to 48% compared to the baseline. Moreover, EvoGFuzz managed to expose 11 unique defects, from which five have not been detected by the baseline.
SEJun 21, 2019
Challenges for Verifying and Validating Scientific Software in Computational Materials ScienceThomas Vogel, Stephan Druskat, Markus Scheidgen et al.
Many fields of science rely on software systems to answer different research questions. For valid results researchers need to trust the results scientific software produces, and consequently quality assurance is of utmost importance. In this paper we are investigating the impact of quality assurance in the domain of computational materials science (CMS). Based on our experience in this domain we formulate challenges for validation and verification of scientific software and their results. Furthermore, we describe directions for future research that can potentially help dealing with these challenges.
SEMar 12, 2019
Perpetual Assurances for Self-Adaptive SystemsDanny Weyns, Nelly Bencomo, Radu Calinescu et al.
Providing assurances for self-adaptive systems is challenging. A primary underlying problem is uncertainty that may stem from a variety of different sources, ranging from incomplete knowledge to sensor noise and uncertain behavior of humans in the loop. Providing assurances that the self-adaptive system complies with its requirements calls for an enduring process spanning the whole lifetime of the system. In this process, humans and the system jointly derive and integrate new evidence and arguments, which we coined perpetual assurances for self-adaptive systems. In this paper, we provide a background framework and the foundation for perpetual assurances for self-adaptive systems. We elaborate on the concrete challenges of offering perpetual assurances, requirements for solutions, realization techniques and mechanisms to make solutions suitable. We also present benchmark criteria to compare solutions. We then present a concrete exemplar that researchers can use to assess and compare approaches for perpetual assurances for self-adaptation.
SEDec 18, 2018
Inputs from Hell: Generating Uncommon Inputs from Common SamplesEsteban Pavese, Ezekiel Soremekun, Nikolas Havrikov et al.
Generating structured input files to test programs can be performed by techniques that produce them from a grammar that serves as the specification for syntactically correct input files. Two interesting scenarios then arise for effective testing. In the first scenario, software engineers would like to generate inputs that are as similar as possible to the inputs in common usage of the program, to test the reliability of the program. More interesting is the second scenario where inputs should be as dissimilar as possible from normal usage. This is useful for robustness testing and exploring yet uncovered behavior. To provide test cases for both scenarios, we leverage a context-free grammar to parse a set of sample input files that represent the program's common usage, and determine probabilities for individual grammar production as they occur during parsing the inputs. Replicating these probabilities during grammar-based test input generation, we obtain inputs that are close to the samples. Inverting these probabilities yields inputs that are strongly dissimilar to common inputs, yet still valid with respect to the grammar. Our evaluation on three common input formats (JSON, JavaScript, CSS) shows the effectiveness of these approaches in obtaining instances from both sets of inputs.
SEMay 5, 2015
Using Models at Runtime to Address Assurance for Self-Adaptive SystemsBetty Cheng, Kerstin Eder, Martin Gogolla et al.
A self-adaptive software system modifies its behavior at runtime in response to changes within the system or in its execution environment. The fulfillment of the system requirements needs to be guaranteed even in the presence of adverse conditions and adaptations. Thus, a key challenge for self-adaptive software systems is assurance. Traditionally, confidence in the correctness of a system is gained through a variety of activities and processes performed at development time, such as design analysis and testing. In the presence of selfadaptation, however, some of the assurance tasks may need to be performed at runtime. This need calls for the development of techniques that enable continuous assurance throughout the software life cycle. Fundamental to the development of runtime assurance techniques is research into the use of models at runtime (M@RT). This chapter explores the state of the art for usingM@RT to address the assurance of self-adaptive software systems. It defines what information can be captured by M@RT, specifically for the purpose of assurance, and puts this definition into the context of existing work. We then outline key research challenges for assurance at runtime and characterize assurance methods. The chapter concludes with an exploration of selected application areas where M@RT could provide significant benefits beyond existing assurance techniques for adaptive systems.