37.0SEJun 2
Neural Change Prediction: Relating Software Changes to Their Effects and Vice VersaLaura Plein, Souhila Zidane, Jordan Samhi et al.
Much of software development revolves around understanding the relationship between software changes and their effects. If we could learn and predict those relationships, such predictions could benefit several areas of software engineering. While recent advances in artificial intelligence have shown great promise in software engineering tasks, predicting the semantics of code without executing it remains a big challenge. In this paper, we present Neural Change Prediction, a novel and fundamental technique to learn and predict associations between software changes and their dynamic effects on program behavior. Specifically, for a given program and test inputs, we automatically apply numerous mutations to the code and observe how these changes alter the program's output. From these (changes to software, changes in behavior)-pairs, we create models that: (1) for a desired change in behavior, predict where and how the code should be changed (feature localization, software evolution, and software repair); and (2) for a given code change, predict how this code change affects the output (effect prediction). We have conducted a detailed case study on CSS configuration files and an evaluation on Python programs to demonstrate the generality and wide applicability of Neural Change Prediction. While Neural Change Prediction requires numerous mutations (and thus numerous executions of the program under test), Neural Change Prediction is fully automatic and does not require any prior knowledge of the code or its semantics, making it applicable to any software artifact that can be executed and whose output can be observed.
SESep 28, 2023
Revisiting Neural Program Smoothing for FuzzingMaria-Irina Nicolae, Max Eisele, Andreas Zeller
Testing with randomly generated inputs (fuzzing) has gained significant traction due to its capacity to expose program vulnerabilities automatically. Fuzz testing campaigns generate large amounts of data, making them ideal for the application of machine learning (ML). Neural program smoothing (NPS), a specific family of ML-guided fuzzers, aims to use a neural network as a smooth approximation of the program target for new test case generation. In this paper, we conduct the most extensive evaluation of NPS fuzzers against standard gray-box fuzzers (>11 CPU years and >5.5 GPU years), and make the following contributions: (1) We find that the original performance claims for NPS fuzzers do not hold; a gap we relate to fundamental, implementation, and experimental limitations of prior works. (2) We contribute the first in-depth analysis of the contribution of machine learning and gradient-based mutations in NPS. (3) We implement Neuzz++, which shows that addressing the practical limitations of NPS fuzzers improves performance, but that standard gray-box fuzzers almost always surpass NPS-based fuzzers. (4) As a consequence, we propose new guidelines targeted at benchmarking fuzzing based on machine learning, and present MLFuzz, a platform with GPU access for easy and reproducible evaluation of ML-based fuzzers. Neuzz++, MLFuzz, and all our data are public.
SEJul 11, 2024
Learning Program Behavioral Models from Synthesized Input-Output PairsTural Mammadov, Dietrich Klakow, Alexander Koller et al.
We introduce Modelizer - a novel framework that, given a black-box program, learns a model from its input/output behavior using neural machine translation algorithms. The resulting model mocks the original program: Given an input, the model predicts the output that would have been produced by the program. However, the model is also reversible - that is, the model can predict the input that would have produced a given output. Finally, the model is differentiable and can be efficiently restricted to predict only a certain aspect of the program behavior. Modelizer uses grammars to synthesize and inputs and unsupervised tokenizers to decompose the resulting outputs, allowing it to learn sequence-to-sequence associations between token streams. Other than input grammars, Modelizer only requires the ability to execute the program. The resulting models are small, requiring fewer than 6.3 million parameters for languages such as Markdown or HTML; and they are accurate, achieving up to 95.4% accuracy and a BLEU score of 0.98 with standard error 0.04 in mocking real-world applications. As it learns from and predicts executions rather than code, Modelizer departs from the LLM-centric research trend, opening new opportunities for program-specific models that are fully tuned towards individual programs. Indeed, we foresee several applications of these models, especially as the output of the program can be any aspect of program behavior. Beyond mocking and predicting program behavior, the models can also synthesize inputs that are likely to produce a particular behavior, such as failures or coverage, thus assisting in program understanding and maintenance.
SEMay 7, 2021Code
What do all these Buttons do? Statically Mining Android User Interfaces at ScaleKonstantin Kuznetsov, Chen Fu, Song Gao et al.
We introduce FRONTMATTER: a tool to automatically mine both user interface models and behavior of Android apps at a large scale with high precision. Given an app, FRONTMATTER statically extracts all declared screens, the user interface elements, their textual and graphical features, as well as Android APIs invoked by interacting with them. Executed on tens of thousands of real-world apps, FRONTMATTER opens the door for comprehensive mining of mobile user interfaces, jumpstarting empirical research at a large scale, addressing questions such as "How many travel apps require registration?", "Which apps do not follow accessibility guidelines?", "Does the user interface correspond to the description?", and many more. FRONTMATTER and the mined dataset are available under an open-source license.
SEJan 8, 2021Code
Locating Faults with Program Slicing: An Empirical AnalysisEzekiel Soremekun, Lukas Kirschner, Marcel Böhme et al.
Statistical fault localization is an easily deployed technique for quickly determining candidates for faulty code locations. If a human programmer has to search the fault beyond the top candidate locations, though, more traditional techniques of following dependencies along dynamic slices may be better suited. In a large study of 457 bugs (369 single faults and 88 multiple faults) in 46 open source C programs, we compare the effectiveness of statistical fault localization against dynamic slicing. For single faults, we find that dynamic slicing was eight percentage points more effective than the best performing statistical debugging formula; for 66% of the bugs, dynamic slicing finds the fault earlier than the best performing statistical debugging formula. In our evaluation, dynamic slicing is more effective for programs with single fault, but statistical debugging performs better on multiple faults. Best results, however, are obtained by a hybrid approach: If programmers first examine at most the top five most suspicious locations from statistical debugging, and then switch to dynamic slices, on average, they will need to examine 15% (30 lines) of the code. These findings hold for 18 most effective statistical debugging formulas and our results are independent of the number of faults (i.e. single or multiple faults) and error type (i.e. artificial or real errors).
LGFeb 5
Synthesizing Realistic Test Data without Breaking PrivacyLaura Plein, Alexi Turcotte, Arina Hallemans et al.
There is a need for synthetic training and test datasets that replicate statistical distributions of original datasets without compromising their confidentiality. A lot of research has been done in leveraging Generative Adversarial Networks (GANs) for synthetic data generation. However, the resulting models are either not accurate enough or are still vulnerable to membership inference attacks (MIA) or dataset reconstruction attacks since the original data has been leveraged in the training process. In this paper, we explore the feasibility of producing a synthetic test dataset with the same statistical properties as the original one, with only indirectly leveraging the original data in the generation process. The approach is inspired by GANs, with a generation step and a discrimination step. However, in our approach, we use a test generator (a fuzzer) to produce test data from an input specification, preserving constraints set by the original data; a discriminator model determines how close we are to the original data. By evolving samples and determining "good samples" with the discriminator, we can generate privacy-preserving data that follows the same statistical distributions are the original dataset, leading to a similar utility as the original data. We evaluated our approach on four datasets that have been used to evaluate the state-of-the-art techniques. Our experiments highlight the potential of our approach towards generating synthetic datasets that have high utility while preserving privacy.
SENov 22, 2025
Synthesizing Precise Protocol Specs from Natural Language for Effective Test GenerationKuangxiangzi Liu, Dhiman Chakraborty, Alexander Liggesmeyer et al.
Safety- and security-critical systems have to be thoroughly tested against their specifications. The state of practice is to have _natural language_ specifications, from which test cases are derived manually - a process that is slow, error-prone, and difficult to scale. _Formal_ specifications, on the other hand, are well-suited for automated test generation, but are tedious to write and maintain. In this work, we propose a two-stage pipeline that uses large language models (LLMs) to bridge the gap: First, we extract _protocol elements_ from natural-language specifications; second, leveraging a protocol implementation, we synthesize and refine a formal _protocol specification_ from these elements, which we can then use to massively test further implementations. We see this two-stage approach to be superior to end-to-end LLM-based test generation, as 1. it produces an _inspectable specification_ that preserves traceability to the original text; 2. the generation of actual test cases _no longer requires an LLM_; 3. the resulting formal specs are _human-readable_, and can be reviewed, version-controlled, and incrementally refined; and 4. over time, we can build a _corpus_ of natural-language-to-formal-specification mappings that can be used to further train and refine LLMs for more automatic translations. Our prototype, AUTOSPEC, successfully demonstrated the feasibility of our approach on five widely used _internet protocols_ (SMTP, POP3, IMAP, FTP, and ManageSieve) by applying its methods on their _RFC specifications_ written in natural-language, and the recent _I/O grammar_ formalism for protocol specification and fuzzing. In its evaluation, AUTOSPEC recovers on average 92.8% of client and 80.2% of server message types, and achieves 81.5% message acceptance across diverse, real-world systems.
SEFeb 27, 2025
Will AI replace Software Engineers? Do not hold your breathAbhik Roychoudhury, Andreas Zeller
Artificial Intelligence (AI) technology such as Large Language Models (LLMs) have become extremely popular in creating code. This has led to the conjecture that future software jobs will be exclusively conducted by LLMs, and the software industry will cease to exist. But software engineering is much more than producing code -- notably, \emph{maintaining} large software and keeping it reliable is a major part of software engineering, which LLMs are not yet capable of.
SESep 23, 2021
FormatFuzzer: Effective Fuzzing of Binary File FormatsRafael Dutra, Rahul Gopinath, Andreas Zeller
Effective fuzzing of programs that process structured binary inputs, such as multimedia files, is a challenging task, since those programs expect a very specific input format. Existing fuzzers, however, are mostly format-agnostic, which makes them versatile, but also ineffective when a specific format is required. We present FormatFuzzer, a generator for format-specific fuzzers. FormatFuzzer takes as input a binary template (a format specification used by the 010 Editor) and compiles it into C++ code that acts as parser, mutator, and highly efficient generator of inputs conforming to the rules of the language. The resulting format-specific fuzzer can be used as a standalone producer or mutator in black-box settings, where no guidance from the program is available. In addition, by providing mutable decision seeds, it can be easily integrated with arbitrary format-agnostic fuzzers such as AFL to make them format-aware. In our evaluation on complex formats such as MP4 or ZIP, FormatFuzzer showed to be a highly effective producer of valid inputs that also detected previously unknown memory errors in ffmpeg and timidity.
SEMar 4, 2021
Restoring Execution Environments of Jupyter NotebooksJiawei Wang, Li Li, Andreas Zeller
More than ninety percent of published Jupyter notebooks do not state dependencies on external packages. This makes them non-executable and thus hinders reproducibility of scientific results. We present SnifferDog, an approach that 1) collects the APIs of Python packages and versions, creating a database of APIs; 2) analyzes notebooks to determine candidates for required packages and versions; and 3) checks which packages are required to make the notebook executable (and ideally, reproduce its stored results). In its evaluation, we show that SnifferDog precisely restores execution environments for the largest majority of notebooks, making them immediately executable for end users.
SEDec 25, 2020
Fuzzing with Fast Failure FeedbackRahul Gopinath, Bachir Bendrissou, Björn Mathis et al.
Fuzzing -- testing programs with random inputs -- has become the prime technique to detect bugs and vulnerabilities in programs. To generate inputs that cover new functionality, fuzzers require execution feedback from the program -- for instance, the coverage obtained by previous inputs, or the conditions that need to be resolved to cover new branches. If such execution feedback is not available, though, fuzzing can only rely on chance, which is ineffective. In this paper, we introduce a novel fuzzing technique that relies on failure feedback only -- that is, information on whether an input is valid or not, and if not, where the error occurred. Our bFuzzer tool enumerates byte after byte of the input space and tests the program until it finds valid prefixes, and continues exploration from these prefixes. Since no instrumentation or execution feedback is required, bFuzzer is language agnostic and the required tests execute very quickly. We evaluate our technique on five subjects, and show that bFuzzer is effective and efficient even in comparison to its white-box counterpart.
SEDec 12, 2019
Inferring Input Grammars from Dynamic Control FlowRahul Gopinath, Björn Mathis, Andreas Zeller
A program is characterized by its input model, and a formal input model can be of use in diverse areas including vulnerability analysis, reverse engineering, fuzzing and software testing, clone detection and refactoring. Unfortunately, input models for typical programs are often unavailable or out of date. While there exist algorithms that can mine the syntactical structure of program inputs, they either produce unwieldy and incomprehensible grammars, or require heuristics that target specific parsing patterns. In this paper, we present a general algorithm that takes a program and a small set of sample inputs and automatically infers a readable context-free grammar capturing the input language of the program. We infer the syntactic input structure only by observing access of input characters at different locations of the input parser. This works on all program stack based recursive descent input parsers, including PEG and parser combinators, and can do entirely without program specific heuristics. Our Mimid prototype produced accurate and readable grammars for a variety of evaluation subjects, including expr, URLparse, and microJSON.
SENov 18, 2019
Building Fast FuzzersRahul Gopinath, Andreas Zeller
Fuzzing is one of the key techniques for evaluating the robustness of programs against attacks. Fuzzing has to be effective in producing inputs that cover functionality and find vulnerabilities. But it also has to be efficient in producing such inputs quickly. Random fuzzers are very efficient, as they can quickly generate random inputs; but they are not very effective, as the large majority of inputs generated is syntactically invalid. Grammar-based fuzzers make use of a grammar (or another model for the input language) to produce syntactically correct inputs, and thus can quickly cover input space and associated functionality. Existing grammar-based fuzzers are surprisingly inefficient, though: Even the fastest grammar fuzzer Dharma still produces inputs about a thousand times slower than the fastest random fuzzer. So far, one can have an effective or an efficient fuzzer, but not both. In this paper, we describe how to build fast grammar fuzzers from the ground up, treating the problem of fuzzing from a programming language implementation perspective. Starting with a Python textbook approach, we adopt and adapt optimization techniques from functional programming and virtual machine implementation techniques together with other novel domain-specific optimizations in a step-by-step fashion. In our F1 prototype fuzzer, these improve production speed by a factor of 100--300 over the fastest grammar fuzzer Dharma. As F1 is even 5--8 times faster than a lexical random fuzzer, we can find bugs faster and test with much larger valid inputs than previously possible.
SEJun 12, 2019
Better Code, Better Sharing:On the Need of Analyzing Jupyter NotebooksJiawei Wang, Li Li, Andreas Zeller
By bringing together code, text, and examples, Jupyter notebooks have become one of the most popular means to produce scientific results in a productive and reproducible way. As many of the notebook authors are experts in their scientific fields, but laymen with respect to software engineering, one may ask questions on the quality of notebooks and their code. In a preliminary study, we experimentally demonstrate that Jupyter notebooks are inundated with poor quality code, e.g., not respecting recommended coding practices, or containing unused variables and deprecated functions. Considering the education nature of Jupyter notebooks, these poor coding practices as well as the lacks of quality control might be propagated into the next generation of developers. Hence, we argue that there is a strong need to programmatically analyze Jupyter notebooks, calling on our community to pay more attention to the reliability of Jupyter notebooks.
SEJun 4, 2019
Bridging the Gap between Unit Test Generation and System Test GenerationAlexander Kampmann, Andreas Zeller
Common test generators fall into two categories. Generating test inputs at the unit level is fast, but can lead to false alarms when a function is called with inputs that would not occur in a system context. If a generated input at the system level causes a failure, this is a true alarm, as the input could also have come from the user or a third party; but system testing is much slower. In this paper, we introduce the concept of a test generation bridge, which joins the accuracy of system testing with the speed of unit testing. A Test Generation Bridge allows to combine an arbitrary system test generator with an arbitrary unit test generator. It does so by carving parameterized unit tests from system (test) executions. These unit tests run in a context recorded from the system test, but individual parameters are left free for the unit test generator to systematically explore. This allows symbolic test generators such as KLEE to operate on individual functions in the recorded system context. If the test generator detects a failure, we lift the failure-inducing parameter back to the system input; if the failure can be reproduced at the system level, it is reported as a true alarm. Our BASILISK prototype can extract and test units out of complex systems such as a Web/Python/SQLite/C stack; in its evaluation, it achieves a higher coverage than a state-of-the-art system test generator.
SEDec 19, 2018
Carving Parameterized Unit TestsAlexander Kampmann, Andreas Zeller
We present a method to automatically extract ("carve") parameterized unit tests from system executions. The unit tests execute the same functions as the system tests they are carved from, but can do so much faster as they call functions directly; furthermore, being parameterized, they can execute the functions with a large variety of randomly selected input values. If a unit-level test fails, we lift it to the system level to ensure the failure can be reproduced there. Our method thus allows to focus testing efforts on selected modules while still avoiding false alarms: In our experiments, running parameterized unit tests for individual functions was, on average, 30~times faster than running the system tests they were carved from.
SEDec 18, 2018
Inputs from Hell: Generating Uncommon Inputs from Common SamplesEsteban Pavese, Ezekiel Soremekun, Nikolas Havrikov et al.
Generating structured input files to test programs can be performed by techniques that produce them from a grammar that serves as the specification for syntactically correct input files. Two interesting scenarios then arise for effective testing. In the first scenario, software engineers would like to generate inputs that are as similar as possible to the inputs in common usage of the program, to test the reliability of the program. More interesting is the second scenario where inputs should be as dissimilar as possible from normal usage. This is useful for robustness testing and exploring yet uncovered behavior. To provide test cases for both scenarios, we leverage a context-free grammar to parse a set of sample input files that represent the program's common usage, and determine probabilities for individual grammar production as they occur during parsing the inputs. Replicating these probabilities during grammar-based test input generation, we obtain inputs that are close to the samples. Inverting these probabilities yields inputs that are strongly dissimilar to common inputs, yet still valid with respect to the grammar. Our evaluation on three common input formats (JSON, JavaScript, CSS) shows the effectiveness of these approaches in obtaining instances from both sets of inputs.
SEOct 18, 2018
Sample-Free Learning of Input Grammars for Comprehensive Software FuzzingRahul Gopinath, Björn Mathis, Mathias Höschele et al.
Generating valid test inputs for a program is much easier if one knows the input language. We present first successes for a technique that, given a program P without any input samples or models, learns an input grammar that represents the syntactically valid inputs for P -- a grammar which can then be used for highly effective test generation for P . To this end, we introduce a test generator targeted at input parsers that systematically explores parsing alternatives based on dynamic tracking of constraints; the resulting inputs go into a grammar learner producing a grammar that can then be used for fuzzing. In our evaluation on subjects such as JSON, URL, or Mathexpr, our PYGMALION prototype took only a few minutes to infer grammars and generate thousands of valid high-quality inputs.
CRNov 14, 2016
Quantifying the Information Leak in Cache Attacks through Symbolic ExecutionSudipta Chattopadhyay, Moritz Beck, Ahmed Rezine et al.
Cache timing attacks allow attackers to infer the properties of a secret execution by observing cache hits and misses. But how much information can actually leak through such attacks? For a given program, a cache model, and an input, our CHALICE framework leverages symbolic execution to compute the amount of information that can possibly leak through cache attacks. At the core of CHALICE is a novel approach to quantify information leak that can highlight critical cache side-channel leaks on arbitrary binary code. In our evaluation on real-world programs from OpenSSL and Linux GDK libraries, CHALICE effectively quantifies information leaks: For an AES-128 implementation on Linux, for instance, CHALICE finds that a cache attack can leak as much as 127 out of 128 bits of the encryption key.
SEJul 20, 2014
Inferring Loop Invariants by Mutation, Dynamic Analysis, and Static CheckingJuan P. Galeotti, Carlo A. Furia, Eva May et al.
Verifiers that can prove programs correct against their full functional specification require, for programs with loops, additional annotations in the form of loop invariants---propeties that hold for every iteration of a loop. We show that significant loop invariant candidates can be generated by systematically mutating postconditions; then, dynamic checking (based on automatically generated tests) weeds out invalid candidates, and static checking selects provably valid ones. We present a framework that automatically applies these techniques to support a program prover, paving the way for fully automatic verification without manually written loop invariants: Applied to 28 methods (including 39 different loops) from various java.util classes (occasionally modified to avoid using Java features not fully supported by the static checker), our DYNAMATE prototype automatically discharged 97% of all proof obligations, resulting in automatic complete correctness proofs of 25 out of the 28 methods---outperforming several state-of-the-art tools for fully automatic verification.
SEMar 5, 2014
Automated Fixing of Programs with ContractsYu Pei, Carlo A. Furia, Martin Nordio et al.
This paper describes AutoFix, an automatic debugging technique that can fix faults in general-purpose software. To provide high-quality fix suggestions and to enable automation of the whole debugging process, AutoFix relies on the presence of simple specification elements in the form of contracts (such as pre- and postconditions). Using contracts enhances the precision of dynamic analysis techniques for fault detection and localization, and for validating fixes. The only required user input to the AutoFix supporting tool is then a faulty program annotated with contracts; the tool produces a collection of validated fixes for the fault ranked according to an estimate of their suitability. In an extensive experimental evaluation, we applied AutoFix to over 200 faults in four code bases of different maturity and quality (of implementation and of contracts). AutoFix successfully fixed 42% of the faults, producing, in the majority of cases, corrections of quality comparable to those competent programmers would write; the used computational resources were modest, with an average time per fix below 20 minutes on commodity hardware. These figures compare favorably to the state of the art in automated program fixing, and demonstrate that the AutoFix approach is successfully applicable to reduce the debugging burden in real-world scenarios.