69.2SEMay 15Code
AI Policy, Disclosure, and Human in the Loop: How Are Contribution Guidelines Adapting to GenAI?Andre Hora, Romain Robbes
Generative AI (GenAI) has recently transformed software development. Due to the ease of generating code, open source projects are experiencing a growth in contributions. To address the rise of GenAI, open source projects have begun implementing policies for AI usage in contributions. However, the extent to which open source specifies whether AI-assisted contributions are allowed or prohibited, along with the best practices for contributors, remains unclear. This paper provides an initial empirical study to explore how open source projects are adapting to GenAI contributions. We analyzed 1,000 popular GitHub repositories and identified 118 AI policies for contributors. Our results show that (1) 78% of the AI policies allow contributions generated with GenAI, while 22% explicitly discourage their use; (2) 51% of the AI policies require the disclosure of AI-generated contributions; and (3) 74% of the AI policies require a human in the loop during contribution. Overall, we find that the majority of the analyzed AI policies are positive regarding the usage of GenAI. However, AI disclosure and human in the loop are fundamental in the contribution process. Finally, we conclude by discussing implications for developers and researchers.
42.4SEMay 15Code
What's Inside a GitHub Repository? An Empirical Study on the Contents of 10K ProjectsAndre Hora, João Eduardo Montandon, Diego Elias Costa
GitHub is the largest code hosting platform, with millions of repositories spanning multiple technologies. Despite this, little is known about the actual contents of GitHub's repositories in the wild. This paper presents an initial empirical analysis to better understand the contents of real-world GitHub repositories. We analyze the files, directories, and extensions present in 10,000 GitHub repositories, as well as their evolution over ten years. Our results show major changes in GitHub over the last decade: (1) the consolidation of README.md, .gitignore, and LICENSE as standard artifacts; (2) the rise of GitHub Actions as the dominant CI/CD platform; (3) the growth of configuration formats such as TOML, YAML, and JSON, alongside a decline in XML; (4) new trends, such as the growth of Dockerfile; and (5) emerging content related to LLMs and generative AI (e.g., AGENTS.md). Based on our findings, we discuss implications, including that open source is not only evolving organically but also increasingly guided by GitHub's standards, the rise and fall of technologies, and the potential support for mining software repository studies.
70.7SEApr 8
Agentic Much? Adoption of Coding Agents on GitHubRomain Robbes, Théo Matricon, Thomas Degueule et al.
In the first half of 2025, coding agents have emerged as a category of development tools that have very quickly transitioned to the practice. Unlike ''traditional'' code completion LLMs such as Copilot, agents like Cursor, Claude Code, or Codex operate with high degrees of autonomy, up to generating complete pull requests starting from a developer-provided task description. This new mode of operation is poised to change the landscape in an even larger way than code completion LLMs did, making the need to study their impact critical. Also, unlike traditional LLMs, coding agents tend to leave more explicit traces in software engineering artifacts, such as co-authoring commits or pull requests. We leverage these traces to present the first large-scale study (128,018 projects) of the adoption of coding agents on GitHub, finding an estimated adoption rate of 22.20%--28.66%, which is very high for a technology only a few months old--and increasing. We carry out an in-depth study of the adopters we identified, finding that adoption is broad: it spans the entire spectrum of project maturity; it includes established organizations; and it concerns diverse programming languages or project topics. At the commit level, we find that commits assisted by coding agents are larger than commits only authored by human developers, and have a large proportion of features and bug fixes. These findings highlight the need for further investigation into the practical use of coding agents.
SEJan 12, 2022Code
Towards a Catalog of Composite RefactoringsAline Brito, Andre Hora, Marco Tulio Valente
Catalogs of refactoring have key importance in software maintenance and evolution, since developers rely on such documents to understand and perform refactoring operations. Furthermore, these catalogs constitute a reference guide for communication between practitioners since they standardize a common refactoring vocabulary. Fowler's book describes the most popular catalog of refactorings, which documents single and well-known refactoring operations. However, sometimes refactorings are composite transformations, i.e., a sequence of refactorings is performed over a given program element. For example, a sequence of Extract Method operations (a single refactoring) can be performed over the same method, in one or in multiple commits, to simplify its implementation, therefore, leading to a Method Decomposition operation (a composite refactoring). In this paper, we propose and document a catalog with eight composite refactorings. We also implement a set of scripts to mine composite refactorings by preprocessing the results of refactoring detection tools. Using such scripts, we search for composites in a representative refactoring oracle with hundreds of confirmed single refactoring operations. Next, to complement this first study, we also search for composites in the full history of ten well-known open-source projects. We characterize the detected composite refactorings, under dimensions such as size and location. We conclude by addressing the applications and implications of the proposed catalog.
SEMar 10, 2020Code
Refactoring Graphs: Assessing Refactoring over TimeAline Brito, Andre Hora, Marco Tulio Valente
Refactoring is an essential activity during software evolution. Frequently, practitioners rely on such transformations to improve source code maintainability and quality. As a consequence, this process may produce new source code entities or change the structure of existing ones. Sometimes, the transformations are atomic, i.e., performed in a single commit. In other cases, they generate sequences of modifications performed over time. To study and reason about refactorings over time, in this paper, we propose a novel concept called refactoring graphs and provide an algorithm to build such graphs. Then, we investigate the history of 10 popular open-source Java-based projects. After eliminating trivial graphs, we characterize a large sample of 1,150 refactoring graphs, providing quantitative data on their size, commits, age, refactoring composition, and developers. We conclude by discussing applications and implications of refactoring graphs, for example, to improve code comprehension, detect refactoring patterns, and support software evolution studies.
SEMar 15, 2018Code
Why We Engage in FLOSS: Answers from Core DevelopersJailton Coelho, Marco Tulio Valente, Luciana L. Silva et al.
The maintenance and evolution of Free/Libre Open Source Software (FLOSS) projects demand the constant attraction of core developers. In this paper, we report the results of a survey with 52 developers, who recently became core contributors of popular GitHub projects. We reveal their motivations to assume a key role in FLOSS projects (e.g., improving the projects because they are also using it), the project characteristics that most helped in their engagement process (e.g., a friendly community), and the barriers faced by the surveyed core developers (e.g., lack of time of the project leaders). We also compare our results with related studies about others kinds of open source contributors (casual, one-time, and newcomers).
SEMar 8, 2017Code
Assessing Code Authorship: The Case of the Linux KernelGuilherme Avelino, Leonardo Passos, Andre Hora et al.
Code authorship is a key information in large-scale open source systems. Among others, it allows maintainers to assess division of work and identify key collaborators. Interestingly, open-source communities lack guidelines on how to manage authorship. This could be mitigated by setting to build an empirical body of knowledge on how authorship-related measures evolve in successful open-source communities. Towards that direction, we perform a case study on the Linux kernel. Our results show that: (a) only a small portion of developers (26 %) makes significant contributions to the code base; (b) the distribution of the number of files per author is highly skewed --- a small group of top authors (3 %) is responsible for hundreds of files, while most authors (75 %) are responsible for at most 11 files; (c) most authors (62 %) have a specialist profile; (d) authors with a high number of co-authorship connections tend to collaborate with others with less connections.
SEJul 14, 2016Code
Predicting the Popularity of GitHub RepositoriesHudson Borges, Andre Hora, Marco Tulio Valente
GitHub is the largest source code repository in the world. It provides a git-based source code management platform and also many features inspired by social networks. For example, GitHub users can show appreciation to projects by adding stars to them. Therefore, the number of stars of a repository is a direct measure of its popularity. In this paper, we use multiple linear regressions to predict the number of stars of GitHub repositories. These predictions are useful both to repository owners and clients, who usually want to know how their projects are performing in a competitive open source development market. In a large-scale analysis, we show that the proposed models start to provide accurate predictions after being trained with the number of stars received in the last six months. Furthermore, specific models---generated using data from repositories that share the same growth trends---are recommended for repositories with slow growth and/or for repositories with less stars. Finally, we evaluate the ability to predict not the number of stars of a repository but its rank among the GitHub repositories. We found a very strong correlation between predicted and real rankings (Spearman's rho greater than 0.95).
SEJun 15, 2016Code
Understanding the Factors that Impact the Popularity of GitHub RepositoriesHudson Borges, Andre Hora, Marco Tulio Valente
Software popularity is a valuable information to modern open source developers, who constantly want to know if their systems are attracting new users, if new releases are gaining acceptance, or if they are meeting user's expectations. In this paper, we describe a study on the popularity of software systems hosted at GitHub, which is the world's largest collection of open source software. GitHub provides an explicit way for users to manifest their satisfaction with a hosted repository: the stargazers button. In our study, we reveal the main factors that impact the number of stars of GitHub projects, including programming language and application domain. We also study the impact of new features on project popularity. Finally, we identify four main patterns of popularity growth, which are derived after clustering the time series representing the number of stars of 2,279 popular GitHub repositories. We hope our results provide valuable insights to developers and maintainers, which can help them on building and evolving systems in a competitive software market.
SEJul 2, 2015Code
On the Popularity of GitHub Applications: A Preliminary NoteHudson Borges, Marco Tulio Valente, Andre Hora et al.
GitHub is the world's largest collection of open source software. Therefore, it is important both to software developers and users to compare and track the popularity of GitHub repositories. In this paper, we propose a framework to assess the popularity of GitHub software, using their number of stars. We also propose a set of popularity growth patterns, which describe the evolution of the number of stars of a system over time. We show that stars tend to correlate with other measures, like forks, and with the effective usage of GitHub software by third-party programs. Throughout the paper we illustrate the application of our framework using real data extracted from GitHub.
SEJul 12, 2019
Framework Code Samples: How Are They Maintained and Used by Developers?Gabriel Menezes, Bruno Cafeo, Andre Hora
Background: Modern software systems are commonly built on the top of frameworks. To accelerate the learning process of features provided by frameworks, code samples are made available to assist developers. However, we know little about how code samples are actually developed. Aims: In this paper, we aim to fill this gap by assessing the characteristics of framework code samples. We provide insights on how code samples are maintained and used by developers. Method: We analyze 233 code samples of Android and SpringBoot, and assess aspects related to their source code, evolution, popularity, and client usage. Results: We find that most code samples are small and simple, provide a working environment to the clients, and rely on automated build tools. They change frequently over time, for example, to adapt to new framework versions. We also detect that clients commonly fork the code samples, however, they rarely modify them. Conclusions: We provide a set of lessons learned and implications to creators and clients of code samples to improve maintenance and usage activities.
SEJan 16, 2018
Why and How Java Developers Break APIsAline Brito, Laerte Xavier, Andre Hora et al.
Modern software development depends on APIs to reuse code and increase productivity. As most software systems, these libraries and frameworks also evolve, which may break existing clients. However, the main reasons to introduce breaking changes in APIs are unclear. Therefore, in this paper, we report the results of an almost 4-month long field study with the developers of 400 popular Java libraries and frameworks. We configured an infrastructure to observe all changes in these libraries and to detect breaking changes shortly after their introduction in the code. After identifying breaking changes, we asked the developers to explain the reasons behind their decision to change the APIs. During the study, we identified 59 breaking changes, confirmed by the developers of 19 projects. By analyzing the developers' answers, we report that breaking changes are mostly motivated by the need to implement new features, by the desire to make the APIs simpler and with fewer elements, and to improve maintainability. We conclude by providing suggestions to language designers, tool builders, software engineering researchers and API developers.
SEMay 15, 2017
CodeCity for (and by) JavaScriptMarcos Viana, Andre Hora, Marco Tulio Valente
JavaScript is one of the most popular programming languages on the web. Despite the language popularity and the increasing size of JavaScript systems, there is a limited number of visualization tools that can be used by developers to comprehend, maintain, and evolve JavaScript software. In this paper, we introduce JSCity, an implementation in JavaScript of the well-known Code City software visualization metaphor. JSCity relies on JavaScript features and libraries to show "software cities" in standard web browsers, without requiring complex installation procedures. We also report our experience on producing visualizations for 40 popular JavaScript systems using JScity.
SEApr 22, 2016
A Novel Approach for Estimating Truck FactorsGuilherme Avelino, Leonardo Passos, Andre Hora et al.
Truck Factor (TF) is a metric proposed by the agile community as a tool to identify concentration of knowledge in software development environments. It states the minimal number of developers that have to be hit by a truck (or quit) before a project is incapacitated. In other words, TF helps to measure how prepared is a project to deal with developer turnover. Despite its clear relevance, few studies explore this metric. Altogether there is no consensus about how to calculate it, and no supporting evidence backing estimates for systems in the wild. To mitigate both issues, we propose a novel (and automated) approach for estimating TF-values, which we execute against a corpus of 133 popular project in GitHub. We later survey developers as a means to assess the reliability of our results. Among others, we find that the majority of our target systems (65%) have TF <= 2. Surveying developers from 67 target systems provides confidence towards our estimates; in 84% of the valid answers we collect, developers agree or partially agree that the TF's authors are the main authors of their systems; in 53% we receive a positive or partially positive answer regarding our estimated truck factors.