SEApr 27, 2019Code
Enabling Open-World Specification Mining via Unsupervised LearningJordan Henkel, Shuvendu K. Lahiri, Ben Liblit et al.
Many programming tasks require using both domain-specific code and well-established patterns (such as routines concerned with file IO). Together, several small patterns combine to create complex interactions. This compounding effect, mixed with domain-specific idiosyncrasies, creates a challenging environment for fully automatic specification inference. Mining specifications in this environment, without the aid of rule templates, user-directed feedback, or predefined API surfaces, is a major challenge. We call this challenge Open-World Specification Mining. In this paper, we present a framework for mining specifications and usage patterns in an Open-World setting. We design this framework to be miner-agnostic and instead focus on disentangling complex and noisy API interactions. To evaluate our framework, we introduce a benchmark of 71 clusters extracted from five open-source projects. Using this dataset, we show that interesting clusters can be recovered, in a fully automatic way, by leveraging unsupervised learning in the form of word embeddings. Once clusters have been recovered, the challenge of Open-World Specification Mining is simplified and any trace-based mining technique can be applied. In addition, we provide a comprehensive evaluation of three word-vector learners to showcase the value of sub-word information for embeddings learned in the software-engineering domain.
SEApr 29, 2020
TOFU: Target-Oriented FUzzerZi Wang, Ben Liblit, Thomas Reps
Program fuzzing---providing randomly constructed inputs to a computer program---has proved to be a powerful way to uncover bugs, find security vulnerabilities, and generate test inputs that increase code coverage. In many applications, however, one is interested in a target-oriented approach-one wants to find an input that causes the program to reach a specific target point in the program. We have created TOFU (for Target-Oriented FUzzer) to address the directed fuzzing problem. TOFU's search is biased according to a distance metric that scores each input according to how close the input's execution trace gets to the target locations. TOFU is also input-structure aware (i.e., the search makes use of a specification of a superset of the program's allowed inputs). Our experiments on xmllint show that TOFU is 28% faster than AFLGo, while reaching 45% more targets. Moreover, both distance-guided search and exploitation of knowledge of the input structure contribute significantly to TOFU's performance.
SEMar 18, 2018
Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic TracesJordan Henkel, Shuvendu K. Lahiri, Ben Liblit et al.
With the rise of machine learning, there is a great deal of interest in treating programs as data to be fed to learning algorithms. However, programs do not start off in a form that is immediately amenable to most off-the-shelf learning techniques. Instead, it is necessary to transform the program to a suitable representation before a learning technique can be applied. In this paper, we use abstractions of traces obtained from symbolic execution of a program as a representation for learning word embeddings. We trained a variety of word embeddings under hundreds of parameterizations, and evaluated each learned embedding on a suite of different tasks. In our evaluation, we obtain 93% top-1 accuracy on a benchmark consisting of over 19,000 API-usage analogies extracted from the Linux kernel. In addition, we show that embeddings learned from (mainly) semantic abstractions provide nearly triple the accuracy of those learned from (mainly) syntactic abstractions.
SEJun 8, 2017
Source Forager: A Search Engine for Similar Source CodeVineeth Kashyap, David Bingham Brown, Ben Liblit et al.
Developers spend a significant amount of time searching for code: e.g., to understand how to complete, correct, or adapt their own code for a new context. Unfortunately, the state of the art in code search has not evolved much beyond text search over tokenized source. Code has much richer structure and semantics than normal text, and this property can be exploited to specialize the code-search process for better querying, searching, and ranking of code-search results. We present a new code-search engine named Source Forager. Given a query in the form of a C/C++ function, Source Forager searches a pre-populated code database for similar C/C++ functions. Source Forager preprocesses the database to extract a variety of simple code features that capture different aspects of code. A search returns the $k$ functions in the database that are most similar to the query, based on the various extracted code features. We tested the usefulness of Source Forager using a variety of code-search queries from two domains. Our experiments show that the ranked results returned by Source Forager are accurate, and that query-relevant functions can be reliably retrieved even when searching through a large code database that contains very few query-relevant functions. We believe that Source Forager is a first step towards much-needed tools that provide a better code-search experience.