SEAug 28, 2023Code
MELT: Mining Effective Lightweight Transformations from Pull RequestsDaniel Ramos, Hailie Mitchell, Inês Lynce et al. · cmu
Software developers often struggle to update APIs, leading to manual, time-consuming, and error-prone processes. We introduce MELT, a new approach that generates lightweight API migration rules directly from pull requests in popular library repositories. Our key insight is that pull requests merged into open-source libraries are a rich source of information sufficient to mine API migration rules. By leveraging code examples mined from the library source and automatically generated code examples based on the pull requests, we infer transformation rules in \comby, a language for structural code search and replace. Since inferred rules from single code examples may be too specific, we propose a generalization procedure to make the rules more applicable to client projects. MELT rules are syntax-driven, interpretable, and easily adaptable. Moreover, unlike previous work, our approach enables rule inference to seamlessly integrate into the library workflow, removing the need to wait for client code migrations. We evaluated MELT on pull requests from four popular libraries, successfully mining 461 migration rules from code examples in pull requests and 114 rules from auto-generated code examples. Our generalization procedure increases the number of matches for mined rules by 9x. We applied these rules to client projects and ran their tests, which led to an overall decrease in the number of warnings and fixing some test cases demonstrating MELT's effectiveness in real-world scenarios.
SEOct 3, 2023
Large Language Models for Test-Free Fault LocalizationAidan Z. H. Yang, Ruben Martins, Claire Le Goues et al.
Fault Localization (FL) aims to automatically localize buggy lines of code, a key first step in many manual and automatic debugging tasks. Previous FL techniques assume the provision of input tests, and often require extensive program analysis, program instrumentation, or data preprocessing. Prior work on deep learning for APR struggles to learn from small datasets and produces limited results on real-world programs. Inspired by the ability of large language models (LLMs) of code to adapt to new tasks based on very few examples, we investigate the applicability of LLMs to line level fault localization. Specifically, we propose to overcome the left-to-right nature of LLMs by fine-tuning a small set of bidirectional adapter layers on top of the representations learned by LLMs to produce LLMAO, the first language model based fault localization approach that locates buggy lines of code without any test coverage information. We fine-tune LLMs with 350 million, 6 billion, and 16 billion parameters on small, manually curated corpora of buggy programs such as the Defects4J corpus. We observe that our technique achieves substantially more confidence in fault localization when built on the larger models, with bug localization performance scaling consistently with the LLM size. Our empirical evaluation shows that LLMAO improves the Top-1 results over the state-of-the-art machine learning fault localization (MLFL) baselines by 2.3%-54.4%, and Top-5 results by 14.4%-35.6%. LLMAO is also the first FL technique trained using a language model architecture that can detect security vulnerabilities down to the code line level.
SEOct 2, 2023
CAT-LM: Training Language Models on Aligned Code And TestsNikitha Rao, Kush Jain, Uri Alon et al.
Testing is an integral part of the software development process. Yet, writing tests is time-consuming and therefore often neglected. Classical test generation tools such as EvoSuite generate behavioral test suites by optimizing for coverage, but tend to produce tests that are hard to understand. Language models trained on code can generate code that is highly similar to that written by humans, but current models are trained to generate each file separately, as is standard practice in natural language processing, and thus fail to consider the code-under-test context when producing a test file. In this work, we propose the Aligned Code And Tests Language Model (CAT-LM), a GPT-style language model with 2.7 Billion parameters, trained on a corpus of Python and Java projects. We utilize a novel pretraining signal that explicitly considers the mapping between code and test files when available. We also drastically increase the maximum sequence length of inputs to 8,192 tokens, 4x more than typical code generation models, to ensure that the code context is available to the model when generating test code. We analyze its usefulness for realistic applications, showing that sampling with filtering (e.g., by compilability, coverage) allows it to efficiently produce tests that achieve coverage similar to ones written by developers while resembling their writing style. By utilizing the code context, CAT-LM generates more valid tests than even much larger language models trained with more data (CodeGen 16B and StarCoder) and substantially outperforms a recent test-specific model (TeCo) at test completion. Overall, our work highlights the importance of incorporating software-specific insights when training language models for code and paves the way to more powerful automated test generation.
SEJan 29, 2022Code
Using Dynamic Binary Instrumentation to Detect Failures in Robotics SoftwareDeborah S. Katz, Christopher S. Timperley, Claire Le Goues
Autonomous and Robotics Systems (ARSs) are widespread, complex, and increasingly coming into contact with the public. Many of these systems are safety-critical, and it is vital to detect software errors to protect against harm. We propose a family of novel techniques to detect unusual program executions and incorrect program behavior. We model execution behavior by collecting low-level signals at run time and using those signals to build machine learning models. These models can identify previously-unseen executions that are more likely to exhibit errors. We describe a tractable approach for collecting dynamic binary runtime signals on ARSs, allowing the systems to absorb most of the overhead from dynamic instrumentation. The architecture of ARSs is particularly well-adapted to hiding the overhead from instrumentation. We demonstrate the efficiency of these approaches on ARDUPILOT -- a popular open-source autopilot software system -- and HUSKY -- an unmanned ground vehicle -- in simulation. We instrument executions to gather data from which we build supervised machine learning models of executions and evaluate the accuracy of these models. We also analyze the amount of training data needed to develop models with various degrees of accuracy, measure the overhead added to executions that use the analysis tool, and analyze which runtime signals are most useful for detecting unusual behavior on the program under test. In addition, we analyze the effects of timing delays on the functional behavior of ARSs.
LGSep 28, 2021Code
Learning to Superoptimize Real-world ProgramsAlex Shypula, Pengcheng Yin, Jeremy Lacomis et al.
Program optimization is the process of modifying software to execute more efficiently. Superoptimizers attempt to find the optimal program by employing significantly more expensive search and constraint solving techniques. Generally, these methods do not scale well to programs in real development scenarios, and as a result, superoptimization has largely been confined to small-scale, domain-specific, and/or synthetic program benchmarks. In this paper, we propose a framework to learn to superoptimize real-world programs by using neural sequence-to-sequence models. We created a dataset consisting of over 25K real-world x86-64 assembly functions mined from open-source projects and propose an approach, Self Imitation Learning for Optimization (SILO) that is easy to implement and outperforms a standard policy gradient learning approach on our dataset. Our method, SILO, superoptimizes 5.9% of our test set when compared with the gcc version 10.3 compiler's aggressive optimization level -O3. We also report that SILO's rate of superoptimization on our test set is over five times that of a standard policy gradient approach and a model pre-trained on compiler optimization demonstration.
SEMar 21, 2021Code
An Empirical Study of OSS-Fuzz BugsZhen Yu Ding, Claire Le Goues
Continuous fuzzing is an increasingly popular technique for automated quality and security assurance. Google maintains OSS-Fuzz: a continuous fuzzing service for open source software. We conduct the first empirical study of OSS-Fuzz, analyzing 23,907 bugs found in 316 projects. We examine the characteristics of fuzzer-found faults, the lifecycles of such faults, and the evolution of fuzzing campaigns over time. We find that OSS-Fuzz is often effective at quickly finding bugs, and developers are often quick to patch them. However, flaky bugs, timeouts, and out of memory errors are problematic, people rarely file CVEs for security vulnerabilities, and fuzzing campaigns often exhibit punctuated equilibria, where developers might be surprised by large spikes in bugs found. Our findings have implications on future fuzzing research and practice.
SEFeb 12, 2021Code
SOAR: A Synthesis Approach for Data Science API RefactoringAnsong Ni, Daniel Ramos, Aidan Yang et al.
With the growth of the open-source data science community, both the number of data science libraries and the number of versions for the same library are increasing rapidly. To match the evolving APIs from those libraries, open-source organizations often have to exert manual effort to refactor the APIs used in the code base. Moreover, due to the abundance of similar open-source libraries, data scientists working on a certain application may have an abundance of libraries to choose, maintain and migrate between. The manual refactoring between APIs is a tedious and error-prone task. Although recent research efforts were made on performing automatic API refactoring between different languages, previous work relies on statistical learning with collected pairwise training data for the API matching and migration. Using large statistical data for refactoring is not ideal because such training data will not be available for a new library or a new version of the same library. We introduce Synthesis for Open-Source API Refactoring (SOAR), a novel technique that requires no training data to achieve API migration and refactoring. SOAR relies only on the documentation that is readily available at the release of the library to learn API representations and mapping between libraries. Using program synthesis, SOAR automatically computes the correct configuration of arguments to the APIs and any glue code required to invoke those APIs. SOAR also uses the interpreter's error messages when running refactored code to generate logical constraints that can be used to prune the search space. Our empirical evaluation shows that SOAR can successfully refactor 80% of our benchmarks corresponding to deep learning models with up to 44 layers with an average run time of 97.23 seconds, and 90% of the data wrangling benchmarks with an average run time of 17.31 seconds.
SENov 20, 2024
Are Large Language Models Memorizing Bug Benchmarks?Daniel Ramos, Claudia Mamede, Kush Jain et al. · cmu
Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular codegen-multi, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LLaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.
LGJan 8, 2025
Fast, Fine-Grained Equivalence Checking for Neural DecompilersLuke Dramko, Claire Le Goues, Edward J. Schwartz
Neural decompilers are machine learning models that reconstruct the source code from an executable program. Critical to the lifecycle of any machine learning model is an evaluation of its effectiveness. However, existing techniques for evaluating neural decompilation models have substantial weaknesses, especially when it comes to showing the correctness of the neural decompiler's predictions. To address this, we introduce codealign, a novel instruction-level code equivalence technique designed for neural decompilers. We provide a formal definition of a relation between equivalent instructions, which we term an equivalence alignment. We show how codealign generates equivalence alignments, then evaluate codealign by comparing it with symbolic execution. Finally, we show how the information codealign provides-which parts of the functions are equivalent and how well the variable names match-is substantially more detailed than existing state-of-the-art evaluation metrics, which report unitless numbers measuring similarity.
SEFeb 1
SPELL: Synthesis of Programmatic Edits using LLMsDaniel Ramos, Catarina Gamboa, Inês Lynce et al.
Library migration is a common but error-prone task in software development. Developers may need to replace one library with another due to reasons like changing requirements or licensing changes. Migration typically entails updating and rewriting source code manually. While automated migration tools exist, most rely on mining examples from real-world projects that have already undergone similar migrations. However, these data are scarce, and collecting them for arbitrary pairs of libraries is difficult. Moreover, these migration tools often miss out on leveraging modern code transformation infrastructure. In this paper, we present a new approach to automated API migration that sidesteps the limitations described above. Instead of relying on existing migration data or using LLMs directly for transformation, we use LLMs to extract migration examples. Next, we use an Agent to generalize those examples to reusable transformation scripts in PolyglotPiranha, a modern code transformation tool. Our method distills latent migration knowledge from LLMs into structured, testable, and repeatable migration logic, without requiring preexisting corpora or manual engineering effort. Experimental results across Python libraries show that our system can generate diverse migration examples and synthesize transformation scripts that generalize to real-world codebases.
CRJun 9, 2024
Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language ModelsAidan Z. H. Yang, Haoye Tian, He Ye et al.
Software security vulnerabilities allow attackers to perform malicious activities to disrupt software operations. Recent Transformer-based language models have significantly advanced vulnerability detection, surpassing the capabilities of static analysis based deep learning models. However, language models trained solely on code tokens do not capture either the explanation of vulnerability type or the data flow structure information of code, both of which are crucial for vulnerability detection. We propose a novel technique that integrates a multitask sequence-to-sequence LLM with pro-gram control flow graphs encoded as a graph neural network to achieve sequence-to-classification vulnerability detection. We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction. Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul), with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset. By training LLMs and GNNs simultaneously using a combination of code and explanatory metrics of a vulnerable program, MSIVD represents a promising direction for advancing LLM-based vulnerability detection that generalizes to unseen data. Based on our findings, we further discuss the necessity for new labelled security vulnerability datasets, as recent LLMs have seen or memorized prior datasets' held-out evaluation data.
SEDec 5, 2021
VarCLR: Variable Semantic Representation Pre-training via Contrastive LearningQibin Chen, Jeremy Lacomis, Edward J. Schwartz et al.
Variable names are critical for conveying intended program behavior. Machine learning-based program analysis methods use variable name representations for a wide range of tasks, such as suggesting new variable names and bug detection. Ideally, such methods could capture semantic relationships between names beyond syntactic similarity, e.g., the fact that the names average and mean are similar. Unfortunately, previous work has found that even the best of previous representation approaches primarily capture relatedness (whether two variables are linked at all), rather than similarity (whether they actually have the same meaning). We propose VarCLR, a new approach for learning semantic representations of variable names that effectively captures variable similarity in this stricter sense. We observe that this problem is an excellent fit for contrastive learning, which aims to minimize the distance between explicitly similar inputs, while maximizing the distance between dissimilar inputs. This requires labeled training data, and thus we construct a novel, weakly-supervised variable renaming dataset mined from GitHub edits. We show that VarCLR enables the effective application of sophisticated, general-purpose language models like BERT, to variable name representation and thus also to related downstream tasks like variable name similarity search or spelling correction. VarCLR produces models that significantly outperform the state-of-the-art on IdBench, an existing benchmark that explicitly captures variable similarity (as distinct from relatedness). Finally, we contribute a release of all data, code, and pre-trained models, aiming to provide a drop-in replacement for variable representations used in either existing or future program analyses that rely on variable names.
SEAug 13, 2021
Augmenting Decompiler Output with Learned Variable Names and TypesQibin Chen, Jeremy Lacomis, Edward J. Schwartz et al.
A common tool used by security professionals for reverse-engineering binaries found in the wild is the decompiler. A decompiler attempts to reverse compilation, transforming a binary to a higher-level language such as C. High-level languages ease reasoning about programs by providing useful abstractions such as loops, typed variables, and comments, but these abstractions are lost during compilation. Decompilers are able to deterministically reconstruct structural properties of code, but comments, variable names, and custom variable types are technically impossible to recover. In this paper we present DIRTY (DecompIled variable ReTYper), a novel technique for improving the quality of decompiler output that automatically generates meaningful variable names and types. Empirical evaluation on a novel dataset of C code mined from GitHub shows that DIRTY outperforms prior work approaches by a sizable margin, recovering the original names written by developers 66.4% of the time and the original types 75.8% of the time.
ROApr 17, 2021
GzScenic: Automatic Scene Generation for Gazebo SimulatorAfsoon Afzal, Claire Le Goues, Christopher S. Timperley
Testing robotic and cyberphysical systems in simulation require specifications of the simulated environments (i.e., scenes). The Scenic domain-specific language provides a high-level probabilistic programming language that allows users to specify scenarios for simulation. Scenic automatically generates concrete scenes that can be rendered by simulators. However, Scenic is mainly designed for autonomous vehicle simulation and does not support the most popular general-purpose simulator: Gazebo. In this work, we present GzScenic; a tool that automatically generates scenes for simulation in Gazebo. GzScenic automatically generates both the models required for running Scenic on the scenarios, and the models that Gazebo requires for running the simulation.
SEAug 3, 2020
Understanding and Improving Artifact Sharing in Software Engineering ResearchChristopher S. Timperley, Lauren Herckis, Claire Le Goues et al.
In recent years, many software engineering researchers have begun to include artifacts alongside their research papers. Ideally, artifacts, including tools, benchmarks, and data, support the dissemination of ideas, provide evidence for research claims, and serve as a starting point for future research. However, in practice, artifacts suffer from a variety of issues that prevent the realization of their full potential. To help the software engineering community realize the full potential of artifacts, we seek to understand the challenges involved in the creation, sharing, and use of artifacts. To that end, we perform a mixed-methods study including a survey of artifacts in software engineering publications, and an online survey of 153 software engineering researchers. By analyzing the perspectives of artifact creators, users, and reviewers, we identify several high-level challenges that affect the quality of artifacts including mismatched expectations between these groups, and a lack of sufficient reward for both creators and reviewers. Using Diffusion of Innovations as an analytical framework, we examine how these challenges relate to one another, and build an understanding of the factors that affect the sharing and success of artifacts. Finally, we make recommendations to improve the quality of artifacts based on our results and existing best practices.
ROApr 15, 2020
A Study on the Challenges of Using Robotics Simulators for TestingAfsoon Afzal, Deborah S. Katz, Claire Le Goues et al.
Robotics simulation plays an important role in the design, development, and verification and validation of robotic systems. Recent studies have shown that simulation may be used as a cheaper, safer, and more reliable alternative to manual, and widely used, process of field testing. This is particularly important in the context of continuous integration pipelines, where integrated automated testing is key to reducing costs while maintaining system safety. However, simulation and automated testing are not seeing the degree of widespread adoption in practice that their potential would motivate. Our goal in this paper is to develop a principled understanding of the ways developers use simulation in their process, and the challenges they face in doing so. This type of understanding can guide the development of more effective simulators and testing techniques for modern robotics development. To that end, we conduct a survey of 82 robotics developers from a diversity of backgrounds that addresses the current capabilities and limits of simulation technology in practice. We find that simulation is used by 85% of our participants for testing, and that many participants desire to use simulation as part of their test automation. We identify 10 high-level challenges that impede developers from using simulation for manual and automated testing, and general purposes. These challenges include the gap between simulation and reality, a lack of reproducibility, and considerable resource costs associated with using simulators. Finally, we outline avenues for improvement in the development of new simulators that can help simulation reach its potential as a means of verification and validation.
SEMar 26, 2020
Empirical Study of Restarted and Flaky Builds on Travis CIThomas Durieux, Claire Le Goues, Michael Hilton et al.
Continuous Integration (CI) is a development practice where developers frequently integrate code into a common codebase. After the code is integrated, the CI server runs a test suite and other tools to produce a set of reports (e.g., output of linters and tests). If the result of a CI test run is unexpected, developers have the option to manually restart the build, re-running the same test suite on the same code; this can reveal build flakiness, if the restarted build outcome differs from the original build. In this study, we analyze restarted builds, flaky builds, and their impact on the development workflow. We observe that developers restart at least 1.72% of builds, amounting to 56,522 restarted builds in our Travis CI dataset. We observe that more mature and more complex projects are more likely to include restarted builds. The restarted builds are mostly builds that are initially failing due to a test, network problem, or a Travis CI limitations such as execution timeout. Finally, we observe that restarted builds have a major impact on development workflow. Indeed, in 54.42% of the restarted builds, the developers analyze and restart a build within an hour of the initial failure. This suggests that developers wait for CI results, interrupting their workflow to address the issue. Restarted builds also slow down the merging of pull requests by a factor of three, bringing median merging time from 16h to 48h.
SESep 19, 2019
DIRE: A Neural Approach to Decompiled Identifier NamingJeremy Lacomis, Pengcheng Yin, Edward J. Schwartz et al.
The decompiler is one of the most common tools for examining binaries without corresponding source code. It transforms binaries into high-level code, reversing the compilation process. Decompilers can reconstruct much of the information that is lost during the compilation process (e.g., structure and type information). Unfortunately, they do not reconstruct semantically meaningful variable names, which are known to increase code understandability. We propose the Decompiled Identifier Renaming Engine (DIRE), a novel probabilistic technique for variable name recovery that uses both lexical and structural information recovered by the decompiler. We also present a technique for generating corpora suitable for training and evaluating models of decompiled code renaming, which we use to create a corpus of 164,632 unique x86-64 binaries generated from C projects mined from GitHub. Our results show that on this corpus DIRE can predict variable names identical to the names in the original source code up to 74.3% of the time.
SEJan 16, 2018
Debugging Framework Applications: Benefits and ChallengesZack Coker, David Gray Widder, Claire Le Goues et al.
Aspects of frameworks, such as inversion of control and the structure of framework applications, require developers to adjust their debugging strategies as compared to debugging sequential programs. However, the benefits and challenges of framework debugging are not fully understood, and gaining this knowledge could provide guidance in debugging strategies and framework tool design. To gain insight into the process developers use to fix problems in framework applications, we performed two human studies investigating how developers fix applications that use a framework API incorrectly. These studies focused on the Android Fragment class and the ROS framework. We analyzed the results of the studies using a mixed-methods approach, consisting of techniques from grounded theory, qualitative content analysis, and case studies. From our analysis, we produced a theory of the benefits and challenges of framework debugging. This theory states that developers find inversion of control challenging when debugging but find the structure of framework applications helpful. This theory could be used to guide strategies for debugging framework applications and framework tool designs.
DLSep 5, 2017
Effectiveness of Anonymization in Double-Blind ReviewClaire Le Goues, Yuriy Brun, Sven Apel et al.
Double-blind review relies on the authors' ability and willingness to effectively anonymize their submissions. We explore anonymization effectiveness at ASE 2016, OOPSLA 2016, and PLDI 2016 by asking reviewers if they can guess author identities. We find that 74%-90% of reviews contain no correct guess and that reviewers who self-identify as experts on a paper's topic are more likely to attempt to guess, but no more likely to guess correctly. We present our findings, summarize the PC chairs' comments about administering double-blind review, discuss the advantages and disadvantages of revealing author identities part of the way through the process, and conclude by advocating for the continued use of double-blind review.