SEJul 17, 2023
Using an LLM to Help With Code UnderstandingDaye Nam, Andrew Macvean, Vincent Hellendoorn et al.
Understanding code is challenging, especially when working in new and complex development environments. Code comments and documentation can help, but are typically scarce or hard to navigate. Large language models (LLMs) are revolutionizing the process of writing code. Can they do the same for helping understand it? In this study, we provide a first investigation of an LLM-based conversational UI built directly in the IDE that is geared towards code understanding. Our IDE plugin queries OpenAI's GPT-3.5-turbo model with four high-level requests without the user having to write explicit prompts: to explain a highlighted section of code, provide details of API calls used in the code, explain key domain-specific terms, and provide usage examples for an API. The plugin also allows for open-ended prompts, which are automatically contextualized to the LLM with the program being edited. We evaluate this system in a user study with 32 participants, which confirms that using our plugin can aid task completion more than web search. We additionally provide a thorough analysis of the ways developers use, and perceive the usefulness of, our system, among others finding that the usage and benefits differ between students and professionals. We conclude that in-IDE prompt-less interaction with LLMs is a promising future direction for tool builders.
CLApr 20, 2020Code
Incorporating External Knowledge through Pre-training for Natural Language to Code GenerationFrank F. Xu, Zhengbao Jiang, Pengcheng Yin et al.
Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at https://github.com/neulab/external-knowledge-codegen.
SEMar 15, 2019Code
BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and FixesDavid A. Tomassi, Naji Dmeiri, Yichen Wang et al.
Fault-detection, localization, and repair methods are vital to software quality; but it is difficult to evaluate their generality, applicability, and current effectiveness. Large, diverse, realistic datasets of durably-reproducible faults and fixes are vital to good experimental evaluation of approaches to software quality, but they are difficult and expensive to assemble and keep current. Modern continuous-integration (CI) approaches, like Travis-CI, which are widely used, fully configurable, and executed within custom-built containers, promise a path toward much larger defect datasets. If we can identify and archive failing and subsequent passing runs, the containers will provide a substantial assurance of durable future reproducibility of build and test. Several obstacles, however, must be overcome to make this a practical reality. We describe BugSwarm, a toolset that navigates these obstacles to enable the creation of a scalable, diverse, realistic, continuously growing set of durably reproducible failing and passing versions of real-world, open-source systems. The BugSwarm toolkit has already gathered 3,091 fail-pass pairs, in Java and Python, all packaged within fully reproducible containers. Furthermore, the toolkit can be run periodically to detect fail-pass activities, thus growing the dataset continually.
SEMar 3, 2018Code
On Developers' Personality in Large-scale Distributed Projects: The Case of the Apache EcosystemFabio Calefato, Giuseppe Iaffaldano, Filippo Lanubile et al.
Large-scale distributed projects are typically the results of collective efforts performed by multiple developers, each one having a different personality. The study of developers' personalities has the potential of explaining their' behavior in various contexts. For example, the propensity to trust others, a critical factor to the success of global software engineering - has been found to influence positively the result of code reviews in distributed projects. In this paper, we perform a quantitative analysis of developers' personality in open source software projects, intended as an extreme form of distributed projects in which no single organization controls the project. We mine ecosystem-level data from the code commits and email messages contributed by the developers working on the Apache Software Foundation (ASF) projects, as representative of large scale-distributed projects. We find that developers become over time more conscientious, agreeable, and neurotic. Moreover, personality traits do not vary with their role, membership, and extent of contribution to the projects. We also find evidence that more open and more agreeable developers are more likely to become project contributors.
SEDec 6, 2015Code
Continuous integration in a social-coding world: Empirical evidence from GitHub. **Updated version with corrections**Bogdan Vasilescu, Stef van Schuylenburg, Jules Wulms et al.
Continuous integration is a software engineering practice of frequently merging all developer working copies with a shared main branch, e.g., several times a day. With the advent of GitHub, a platform well known for its "social coding" features that aid collaboration and sharing, and currently the largest code host in the open source world, collaborative software development has never been more prominent. In GitHub development one can distinguish between two types of developer contributions to a project: direct ones, coming from a typically small group of developers with write access to the main project repository, and indirect ones, coming from developers who fork the main repository, update their copies locally, and submit pull requests for review and merger. In this paper we explore how GitHub developers use continuous integration as well as whether the contribution type (direct versus indirect) and different project characteristics (e.g., main programming language, or project age) are associated with the success of the automatic builds.
SENov 6, 2025
Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor's Impact on Software ProjectsHao He, Courtney Miller, Shyam Agarwal et al.
Large language models (LLMs) have demonstrated the promise to revolutionize the field of software engineering. Among other things, LLM agents are rapidly gaining momentum in their application to software development, with practitioners claiming a multifold productivity increase after adoption. Yet, empirical evidence is lacking around these claims. In this paper, we estimate the causal effect of adopting a widely popular LLM agent assistant, namely Cursor, on development velocity and software quality. The estimation is enabled by a state-of-the-art difference-in-differences design comparing Cursor-adopting GitHub projects with a matched control group of similar GitHub projects that do not use Cursor. We find that the adoption of Cursor leads to a significant, large, but transient increase in project-level development velocity, along with a significant and persistent increase in static analysis warnings and code complexity. Further panel generalized method of moments estimation reveals that the increase in static analysis warnings and code complexity acts as a major factor causing long-term velocity slowdown. Our study carries implications for software engineering practitioners, LLM agent assistant designers, and researchers.
SEDec 5, 2021
VarCLR: Variable Semantic Representation Pre-training via Contrastive LearningQibin Chen, Jeremy Lacomis, Edward J. Schwartz et al.
Variable names are critical for conveying intended program behavior. Machine learning-based program analysis methods use variable name representations for a wide range of tasks, such as suggesting new variable names and bug detection. Ideally, such methods could capture semantic relationships between names beyond syntactic similarity, e.g., the fact that the names average and mean are similar. Unfortunately, previous work has found that even the best of previous representation approaches primarily capture relatedness (whether two variables are linked at all), rather than similarity (whether they actually have the same meaning). We propose VarCLR, a new approach for learning semantic representations of variable names that effectively captures variable similarity in this stricter sense. We observe that this problem is an excellent fit for contrastive learning, which aims to minimize the distance between explicitly similar inputs, while maximizing the distance between dissimilar inputs. This requires labeled training data, and thus we construct a novel, weakly-supervised variable renaming dataset mined from GitHub edits. We show that VarCLR enables the effective application of sophisticated, general-purpose language models like BERT, to variable name representation and thus also to related downstream tasks like variable name similarity search or spelling correction. VarCLR produces models that significantly outperform the state-of-the-art on IdBench, an existing benchmark that explicitly captures variable similarity (as distinct from relatedness). Finally, we contribute a release of all data, code, and pre-trained models, aiming to provide a drop-in replacement for variable representations used in either existing or future program analyses that rely on variable names.
SEAug 13, 2021
Augmenting Decompiler Output with Learned Variable Names and TypesQibin Chen, Jeremy Lacomis, Edward J. Schwartz et al.
A common tool used by security professionals for reverse-engineering binaries found in the wild is the decompiler. A decompiler attempts to reverse compilation, transforming a binary to a higher-level language such as C. High-level languages ease reasoning about programs by providing useful abstractions such as loops, typed variables, and comments, but these abstractions are lost during compilation. Decompilers are able to deterministically reconstruct structural properties of code, but comments, variable names, and custom variable types are technically impossible to recover. In this paper we present DIRTY (DecompIled variable ReTYper), a novel technique for improving the quality of decompiler output that automatically generates meaningful variable names and types. Empirical evaluation on a novel dataset of C code mined from GitHub shows that DIRTY outperforms prior work approaches by a sizable margin, recovering the original names written by developers 66.4% of the time and the original types 75.8% of the time.
SEJan 27, 2021
In-IDE Code Generation from Natural Language: Promise and ChallengesFrank F. Xu, Bogdan Vasilescu, Graham Neubig
A great part of software development involves conceptualizing or communicating the underlying procedures and logic that needs to be expressed in programs. One major difficulty of programming is turning concept into code, especially when dealing with the APIs of unfamiliar libraries. Recently, there has been a proliferation of machine learning methods for code generation and retrieval from natural language queries, but these have primarily been evaluated purely based on retrieval accuracy or overlap of generated code with developer-written code, and the actual effect of these methods on the developer workflow is surprisingly unattested. We perform the first comprehensive investigation of the promise and challenges of using such technology inside the IDE, asking "at the current state of technology does it improve developer productivity or accuracy, how does it affect the developer experience, and what are the remaining gaps and challenges?" We first develop a plugin for the IDE that implements a hybrid of code generation and code retrieval functionality, and orchestrate virtual environments to enable collection of many user events. We ask developers with various backgrounds to complete 14 Python programming tasks ranging from basic file manipulation to machine learning or data visualization, with or without the help of the plugin. While qualitative surveys of developer experience are largely positive, quantitative results with regards to increased productivity, code quality, or program correctness are inconclusive. Analysis identifies several pain points that could improve the effectiveness of future machine learning based code generation/retrieval developer assistants, and demonstrates when developers prefer code generation over code retrieval and vice versa. We release all data and software to pave the road for future empirical studies and development of better models.
SEJun 22, 2020
Multitasking Across Industry Projects: A Replication StudyKarina Kohl, Bogdan Vasilescu, Rafael Prikladnicki
Background: Multitasking is usual in software development. It is the ability to stop working on a task, switch to another, and return eventually to the first one, as needed or as scheduled. Multitasking, however, comes at a cognitive cost: frequent context-switches can lead to distraction, sub-standard work, and even greater stress. Aims: This paper reports a replication experiment where we gathered data on a group of developers from a software development company from industry on a large collection of projects stored in GitLab repositories. Method: We reused the developed models and methods from the original study for measuring the rate and breadth of a developers' context-switching behavior, and we study how context-switching affects their productivity. We applied semi-structured interviews, replacing the original survey, to some of the developers to understand the reasons for and perceptions of multitasking. Results: We found out that industry developers multitask as much as OSS developers focusing more (on fewer projects), and working more repetitively from one day to the next is associated with higher productivity, but there is no effect for higher multitasking. Some commons reasons make them multitask: dependencies, personal interests, and social relationships. Conclusions: Short context change, less than three minutes, did not impact results from industry developers; however, more than that, it brings a feeling of left the previous tasks behind. So, it is proportional to how much context is switched: as bigger the context and bigger the interruption, it is worst to come back.
SEMar 17, 2020
An Exploratory Study of Bot CommitsTapajit Dey, Bogdan Vasilescu, Audris Mockus
Background: Bots help automate many of the tasks performed by software developers and are widely used to commit code in various social coding platforms. At present, it is not clear what types of activities these bots perform and understanding it may help design better bots, and find application areas which might benefit from bot adoption. Aim: We aim to categorize the Bot Commits by the type of change (files added, deleted, or modified), find the more commonly changed file types, and identify the groups of file types that tend to get updated together. Method: 12,326,137 commits made by 461 popular bots (that made at least 1000 commits) were examined to identify the frequency and the type of files added/ deleted/ modified by the commits, and association rule mining was used to identify the types of files modified together. Result: Majority of the bot commits modify an existing file, a few of them add new files, while deletion of a file is very rare. Commits involving more than one type of operation are even rarer. Files containing data, configuration, and documentation are most frequently updated, while HTML is the most common type in terms of the number of files added, deleted, and modified. Files of the type "Markdown", "Ignore List", "YAML", "JSON" were the types that are updated together with other types of files most frequently. Conclusion: We observe that majority of bot commits involve single file modifications, and bots primarily work with data, configuration, and documentation files. A better understanding if this is a limitation of the bots and, if overcome, would lead to different kinds of bots remains an open question.
SEMar 2, 2020
Detecting and Characterizing Bots that Commit CodeTapajit Dey, Sara Mousavi, Eduardo Ponce et al.
Background: Some developer activity traditionally performed manually, such as making code commits, opening, managing, or closing issues is increasingly subject to automation in many OSS projects. Specifically, such activity is often performed by tools that react to events or run at specific times. We refer to such automation tools as bots and, in many software mining scenarios related to developer productivity or code quality it is desirable to identify bots in order to separate their actions from actions of individuals. Aim: Find an automated way of identifying bots and code committed by these bots, and to characterize the types of bots based on their activity patterns. Method and Result: We propose BIMAN, a systematic approach to detect bots using author names, commit messages, files modified by the commit, and projects associated with the ommits. For our test data, the value for AUC-ROC was 0.9. We also characterized these bots based on the time patterns of their code commits and the types of files modified, and found that they primarily work with documentation files and web pages, and these files are most prevalent in HTML and JavaScript ecosystems. We have compiled a shareable dataset containing detailed information about 461 bots we found (all of whom have more than 1000 commits) and 13,762,430 commits they created.
SESep 19, 2019
DIRE: A Neural Approach to Decompiled Identifier NamingJeremy Lacomis, Pengcheng Yin, Edward J. Schwartz et al.
The decompiler is one of the most common tools for examining binaries without corresponding source code. It transforms binaries into high-level code, reversing the compilation process. Decompilers can reconstruct much of the information that is lost during the compilation process (e.g., structure and type information). Unfortunately, they do not reconstruct semantically meaningful variable names, which are known to increase code understandability. We propose the Decompiled Identifier Renaming Engine (DIRE), a novel probabilistic technique for variable name recovery that uses both lexical and structural information recovered by the decompiler. We also present a technique for generating corpora suitable for training and evaluating models of decompiled code renaming, which we use to create a corpus of 164,632 unique x86-64 binaries generated from C projects mined from GitHub. Our results show that on this corpus DIRE can predict variable names identical to the names in the original source code up to 74.3% of the time.
SEMay 30, 2019
A large-scale, in-depth analysis of developers' personalities in the Apache ecosystemFabio Calefato, Filippo Lanubile, Bogdan Vasilescu
Context: Large-scale distributed projects are typically the results of collective efforts performed by multiple developers with heterogeneous personalities. Objective: We aim to find evidence that personalities can explain developers' behavior in large scale-distributed projects. For example, the propensity to trust others - a critical factor for the success of global software engineering - has been found to influence positively the result of code reviews in distributed projects. Method: In this paper, we perform a quantitative analysis of ecosystem-level data from the code commits and email messages contributed by the developers working on the Apache Software Foundation (ASF) projects, as representative of large scale-distributed projects. Results: We find that there are three common types of personality profiles among Apache developers, characterized in particular by their level of Agreeableness and Neuroticism. We also confirm that developers' personality is stable over time. Moreover, personality traits do not vary with their role, membership, and extent of contribution to the projects. We also find evidence that more open developers are more likely to make contributors to Apache projects. Conclusion: Overall, our findings reinforce the need for future studies on human factors in software engineering to use psychometric tools to control for differences in developers' personalities.
CLMay 23, 2018
Learning to Mine Aligned Code and Natural Language Pairs from Stack OverflowPengcheng Yin, Bowen Deng, Edgar Chen et al.
For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
SEJun 2, 2016
Initial and Eventual Software Quality Relating to Continuous Integration in GitHubYue Yu, Bogdan Vasilescu, Huaimin Wang et al.
The constant demand for new features and bug fixes are forcing software projects to shorten cycles and deliver updates ever faster, while sustaining software quality. The availability of inexpensive, virtualized, cloud-computing has helped shorten schedules, by enabling continuous integration (CI) on demand. Platforms like GitHub support CI in-the-cloud. In projects using CI, a user submitting a pull request triggers a CI step. Besides speeding up build and test, this fortuitously creates voluminous archives of build and test successes and failures. CI is a relatively new phenomenon, and these archives allow a detailed study of CI. How many problems are exposed? Where do they occur? What factors affect CI failures? Does the "initial quality" as ascertained by CI predict how many bugs will later appear ("eventual quality") in the code? In this paper, we undertake a large-scale, fine resolution study of these records, to better understand CI processes, the nature, and predictors of CI failures, and the relationship of CI failures to the eventual quality of the code. We find that: a) CI failures appear to be concentrated in a few files, just like normal bugs; b) CI failures are not very highly correlated with eventual failures; c) The use of CI in a pull request doesn't necessarily mean the code in that request is of good quality.