Christoph Treude

h-index45

88papers

2,012citations

Novelty30%

AI Score54

Ranked #26,505 of 200,471 authors (top 13%)#226 in SE (top 7%)

88 Papers

SEJun 1

Improving LLM-Based Go Code Review through Issue-List Generation and Context Augmentation

Kexin Sun, Yucong Guan, Jiaqi Sun et al.

LLMs have shown strong potential for automating code review, yet their practical utility depends heavily on the design of generation and context strategies. In this paper, we investigate how to improve LLM-based code review through generation strategy and contextual augmentation. We first propose an issue-list review paradigm, in which LLMs enumerate all potential issues rather than reporting only the single most important one (i.e., primary-issue review). We then systematically compare three types of code context augmentation -- neighboring, LSP-based semantics, and IR-based similar co-change context -- and study how they influence issue discovery. Finally, we integrate candidates from no-context and context-enhanced generation to improve review coverage, and introduce refinement-guided pruning to keep the candidate list at a practical size. We evaluate our approach on 1,438 Go review instances using downstream code refinement as the main metric, i.e., how often the candidate list contains at least one comment inducing the same code change as the final human revision. For comparison, we evaluate comments by CodeReviewer, a model trained specifically for review comment generation, as well as ground-truth human review comments (as a practical upper bound), under the same refinement-based evaluation. The results show that our best configuration, combining issue-list review, neighboring and similar co-change context, and candidate integration, reaches 28.00% refinement exact match, a statistically significant gain of +10.85 percentage points over primary-issue review without any additional context (17.15%), substantially outperforming CodeReviewer (15.02%) and approaching the human-oracle ceiling of 36.09%. Our refinement-guided pruning reduces the average candidate count from 7.2 to 3.1 at top-5 while retaining nearly the full benefit, making the candidate list easier to inspect.

SEJul 30, 2022Code

Automatically Categorising GitHub Repositories by Application Domain

Francisco Zanartu, Christoph Treude, Bruno Cartaxo et al.

GitHub is the largest host of open source software on the Internet. This large, freely accessible database has attracted the attention of practitioners and researchers alike. But as GitHub's growth continues, it is becoming increasingly hard to navigate the plethora of repositories which span a wide range of domains. Past work has shown that taking the application domain into account is crucial for tasks such as predicting the popularity of a repository and reasoning about project quality. In this work, we build on a previously annotated dataset of 5,000 GitHub repositories to design an automated classifier for categorising repositories by their application domain. The classifier uses state-of-the-art natural language processing techniques and machine learning to learn from multiple data sources and catalogue repositories according to five application domains. We contribute with (1) an automated classifier that can assign popular repositories to each application domain with at least 70% precision, (2) an investigation of the approach's performance on less popular repositories, and (3) a practical application of this approach to answer how the adoption of software engineering practices differs across application domains. Our work aims to help the GitHub community identify repositories of interest and opens promising avenues for future work investigating differences between repositories from different application domains.

SEApr 5Code

Self-Admitted GenAI Usage in Open-Source Software

Tao Xiao, Youmei Fan, Fabio Calefato et al.

Strategized LaTeX removal and whitespace normalization approachThe widespread adoption of generative AI (GenAI) tools such as GitHub Copilot and ChatGPT is transforming software development. Since generated source code is virtually impossible to distinguish from manually written code, their real-world usage and impact on open-source software (OSS) development remain poorly understood. In this paper, we introduce the concept of self-admitted GenAI usage, that is, developers explicitly referring to the use of GenAI tools for content creation in software artifacts. Using this concept as a lens to study how GenAI tools are integrated into OSS projects, we analyze a curated sample of more than 200,000 GitHub repositories, identifying 1,292 such self-admissions across 156 repositories in commit messages, code comments, and project documentation. Using a mixed methods approach, we derive a taxonomy of 32 tasks, 10 content types, and 11 purposes associated with GenAI usage based on 1,292 qualitatively coded mentions. We then analyze 13 documents with policies and usage guidelines for GenAI tools and conduct a developer survey to uncover the ethical, legal, and practical concerns behind them. Our findings reveal that developers actively manage how GenAI is used in their projects, highlighting the need for project-level transparency, attribution, and quality control practices in AI-assisted software development. Finally, we examine the longitudinal impact of GenAI adoption on code churn in 151 repositories with self-admitted GenAI usage and find no general increase, contradicting popular narratives on the impact of GenAI on software development.

SEMay 10

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

Sebastian Baltes, Florian Angermeir, Chetan Arora et al.

Large Language Models (LLMs) are widely used in software engineering (SE) research and practice, yet their non-determinism, opaque training data, and rapidly evolving models threaten the reproducibility and replicability of empirical studies. We address this challenge through a collaborative effort of 22 researchers, presenting a taxonomy of seven study types that organizes how LLMs are used in SE research, together with eight guidelines for designing and reporting such studies. Each guideline distinguishes requirements (must) from recommended practices (should) and is contextualized by the study types it applies to. Our guidelines recommend that researchers: (1) declare LLM usage and role; (2) report model versions, configurations, and customizations; (3) document the tool architecture beyond the model; (4) disclose prompts, their development, and interaction logs; (5) validate LLM outputs with humans; (6) include an open LLM as a baseline; (7) use suitable baselines, benchmarks, and metrics; and (8) articulate limitations and mitigations. We complement the guidelines with an applicability matrix mapping guidelines to study types and a reporting checklist for authors and reviewers. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines$.$org).

CLApr 15, 2022

Is Surprisal in Issue Trackers Actionable?

James Caddy, Markus Wagner, Christoph Treude et al. · cambridge, microsoft-research

Background. From information theory, surprisal is a measurement of how unexpected an event is. Statistical language models provide a probabilistic approximation of natural languages, and because surprisal is constructed with the probability of an event occuring, it is therefore possible to determine the surprisal associated with English sentences. The issues and pull requests of software repository issue trackers give insight into the development process and likely contain the surprising events of this process. Objective. Prior works have identified that unusual events in software repositories are of interest to developers, and use simple code metrics-based methods for detecting them. In this study we will propose a new method for unusual event detection in software repositories using surprisal. With the ability to find surprising issues and pull requests, we intend to further analyse them to determine if they actually hold importance in a repository, or if they pose a significant challenge to address. If it is possible to find bad surprises early, or before they cause additional troubles, it is plausible that effort, cost and time will be saved as a result. Method. After extracting the issues and pull requests from 5000 of the most popular software repositories on GitHub, we will train a language model to represent these issues. We will measure their perceived importance in the repository, measure their resolution difficulty using several analogues, measure the surprisal of each, and finally generate inferential statistics to describe any correlations.

SESep 17, 2024Code

Leveraging Reviewer Experience in Code Review Comment Generation

Hong Yi Lin, Patanamon Thongtanunam, Christoph Treude et al.

Modern code review is a ubiquitous software quality assurance process aimed at identifying potential issues within newly written code. Despite its effectiveness, the process demands large amounts of effort from the human reviewers involved. To help alleviate this workload, researchers have trained deep learning models to imitate human reviewers in providing natural language code reviews. Formally, this task is known as code review comment generation. Prior work has demonstrated improvements in this task by leveraging machine learning techniques and neural models, such as transfer learning and the transformer architecture. However, the quality of the model generated reviews remain sub-optimal due to the quality of the open-source code review data used in model training. This is in part due to the data obtained from open-source projects where code reviews are conducted in a public forum, and reviewers possess varying levels of software development experience, potentially affecting the quality of their feedback. To accommodate for this variation, we propose a suite of experience-aware training methods that utilise the reviewers' past authoring and reviewing experiences as signals for review quality. Specifically, we propose experience-aware loss functions (ELF), which use the reviewers' authoring and reviewing ownership of a project as weights in the model's loss function. Through this method, experienced reviewers' code reviews yield larger influence over the model's behaviour. Compared to the SOTA model, ELF was able to generate higher quality reviews in terms of accuracy, informativeness, and comment types generated. The key contribution of this work is the demonstration of how traditional software engineering concepts such as reviewer experience can be integrated into the design of AI-based automated code review models.

SEApr 20Code

When AI Models Become Dependencies: Studying the Evolution of Pre-Trained Model Reuse in Downstream Software Systems

Peerachai Banyongrakkul, Mansooreh Zahedi, Christoph Treude et al.

Modern software systems have transitioned from purely code-based architectures to AI-integrated systems where pre-trained models (PTMs) serve as permanent dependencies. However, while the evolution of traditional software libraries is well-documented, we lack a clear understanding of how these "PTM dependencies" change over time. Unlike libraries, PTMs are characterized by opaque internals and less standardized, rapidly evolving release cycles. Furthermore, their multi-role nature enables developers to treat individual instances of a single PTM as separate functional dependencies based on their specific downstream tasks. This raises a critical question for software maintenance: do PTMs change like standard software libraries or do they follow a divergent pattern? To answer this, we present the first empirical study of downstream PTM changes, analyzing a comprehensive dataset of 4,988 releases across 323 GitHub OSS repositories that reuse open-source PTMs. Using traditional software libraries as a baseline, we find that PTMs follow a qualitatively distinct pattern. PTMs are typically added late in the project life-cycle and tend to accumulate rather than be replaced as a project matures. Our findings show that PTM changes are three times less frequent (406 of 2,814 release transitions) than library changes. PTM changes are also less routinely documented, but more likely to carry explicit rationale. Unlike libraries, which evolve reactively, PTM evolution is proactively driven by capability expansion, with a unique documented rationale of PTM testing uncertainty. Our work calls for a rethinking of how PTMs are tracked and managed as dependencies in modern software engineering.

SEMay 21Code

Deterministic vs. Probabilistic Summarisation: An Empirical Trade-off Study in Design Pattern Centric Java Code

Najam Nazar, Christoph Treude

Background: Automated code summarisation supports program comprehension and documentation, yet the relative strengths and limitations of deterministic (heuristic-based) and probabilistic (LLM-based) pipelines remain unclear. Aims: This paper presents a controlled empirical comparison of these paradigms for intent-oriented design-pattern code summarisation. Method: Using design-pattern-centric Java code as a structured testbed (150 files from three open-source repositories covering nine patterns), we compare a rule-based natural language generation (NLG) pipeline, a Software Word Usage Model (SWUM)-based approach, and a probabilistic pipeline based on the Mixtral LLM. Summaries are evaluated against human references using BERTScore and cosine similarity, complemented by rubric-based judgements produced by Llama 3 across five dimensions: accuracy, conciseness, adequacy, code-context awareness, and design-pattern fidelity. Statistical analysis includes Wilcoxon signed-rank tests (with effect sizes), Friedman tests with post-hoc corrections, and Spearman correlation for sensitivity analysis of rubric consistency. Results: Probabilistic summaries show stronger semantic alignment and richer contextual coverage, while deterministic approaches produce more concise and fully reproducible outputs. Prompt-sensitivity and multi-run analyses indicate variability in LLM outputs, though relative trends remain stable. Conclusions: A clear trade-off emerges: probabilistic methods favour semantic depth and contextual accuracy, whereas deterministic pipelines are preferable for brevity and reproducibility. These findings provide practical guidance for selecting code summarisation techniques.

SEMar 17, 2023

She Elicits Requirements and He Tests: Software Engineering Gender Bias in Large Language Models

Christoph Treude, Hideaki Hata

Implicit gender bias in software development is a well-documented issue, such as the association of technical roles with men. To address this bias, it is important to understand it in more detail. This study uses data mining techniques to investigate the extent to which 56 tasks related to software development, such as assigning GitHub issues and testing, are affected by implicit gender bias embedded in large language models. We systematically translated each task from English into a genderless language and back, and investigated the pronouns associated with each task. Based on translating each task 100 times in different permutations, we identify a significant disparity in the gendered pronoun associations with different tasks. Specifically, requirements elicitation was associated with the pronoun "he" in only 6% of cases, while testing was associated with "he" in 100% of cases. Additionally, tasks related to helping others had a 91% association with "he" while the same association for tasks related to asking coworkers was only 52%. These findings reveal a clear pattern of gender bias related to software development tasks and have important implications for addressing this issue both in the training of large language models and in broader society.

SEMay 7

Operationalizing Ethics for AI Agents: How Developers Encode Values into Repository Context Files

Christoph Treude, Sebastian Baltes, Marc Cheong

As AI coding agents become embedded in software development workflows, developers are beginning to operationalize ethical principles by encoding behavioral rules into repository-level context files for AI agents, such as AGENTS.md files. Rather than examining the ethics of AI agents in the abstract, this vision paper investigates how ethics and values are already being translated for AI agents into actionable instructions that shape agent behavior. Through a preliminary investigation, we find that developers are already embedding guidance related to fairness, accessibility, sustainability, tone, and privacy. These artifacts function as a developer-authored governance layer, translating abstract principles into situated, natural-language directives within development workflows. We outline a research agenda for studying this emerging practice, including how encoded values vary across communities, what governance dynamics emerge when multiple contributors negotiate these files, and whether agents reliably adhere to the constraints specified. Understanding how ethics and values are operationalized for AI agents is essential to ground AI governance in modern software engineering practice.

SEMar 18, 2023

Stop Words for Processing Software Engineering Documents: Do they Matter?

Yaohou Fan, Chetan Arora, Christoph Treude

Stop words, which are considered non-predictive, are often eliminated in natural language processing tasks. However, the definition of uninformative vocabulary is vague, so most algorithms use general knowledge-based stop lists to remove stop words. There is an ongoing debate among academics about the usefulness of stop word elimination, especially in domain-specific settings. In this work, we investigate the usefulness of stop word removal in a software engineering context. To do this, we replicate and experiment with three software engineering research tools from related work. Additionally, we construct a corpus of software engineering domain-related text from 10,000 Stack Overflow questions and identify 200 domain-specific stop words using traditional information-theoretic methods. Our results show that the use of domain-specific stop words significantly improved the performance of research tools compared to the use of a general stop list and that 17 out of 19 evaluation measures showed better performance. Online appendix: https://zenodo.org/record/7865748

SEAug 10, 2024

Can LLMs Replace Manual Annotation of Software Engineering Artifacts?

Toufique Ahmed, Premkumar Devanbu, Christoph Treude et al.

Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of a multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task is suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.

SEJul 30, 2022

Adding Context to Source Code Representations for Deep Learning

Fuwei Tian, Christoph Treude

Deep learning models have been successfully applied to a variety of software engineering tasks, such as code classification, summarisation, and bug and vulnerability detection. In order to apply deep learning to these tasks, source code needs to be represented in a format that is suitable for input into the deep learning model. Most approaches to representing source code, such as tokens, abstract syntax trees (ASTs), data flow graphs (DFGs), and control flow graphs (CFGs) only focus on the code itself and do not take into account additional context that could be useful for deep learning models. In this paper, we argue that it is beneficial for deep learning models to have access to additional contextual information about the code being analysed. We present preliminary evidence that encoding context from the call hierarchy along with information from the code itself can improve the performance of a state-of-the-art deep learning model for two software engineering tasks. We outline our research agenda for adding further contextual information to source code representations for deep learning.

SEApr 25

Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions

Kexin Sun, Hongyu Kuang, Sebastian Baltes et al.

AI-based code review tools automatically review and comment on pull requests to improve code quality. Despite their growing presence, little is known about their actual impact. We present a large-scale empirical study of 16 popular AI-based code review actions for GitHub workflows, analyzing more than 22,000 review comments in 178 repositories. We investigate (1) how these tools are adopted and configured, (2) whether their comments lead to code changes, and (3) which factors influence their effectiveness. We develop a two-stage LLM-assisted framework to determine whether review comments are addressed, and use interpretable machine learning to identify influencing factors. Our findings show that, while adoption is growing, effectiveness varies widely. Comments that are concise, contain code snippets, and are manually triggered, particularly those from hunk-level review tools, are more likely to result in code changes. These results highlight the importance of careful tool design and suggest directions for improving AI-based code review systems.

SEApr 30

AI Failures in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges

Haoyu Gao, Mansooreh Zahedi, Wenxin Jiang et al.

With the advancement of AI models, more software systems are adopting AI as a component to facilitate automation. Pre-trained models (PTMs) have become a cornerstone of AI-based software, allowing for rapid integration and development with lower training cost. However, their adoption also introduces failure modes such as data leakage and biased outputs, that may require careful handling by downstream developers. While previous research has proposed taxonomies of these technical concerns and various mitigation strategies, how downstream developers address these issues during the development of general AI-based software when reusing PTMs remains unexplored. Understanding downstream developers' perspectives is essential because they directly influence how these potential failures concerns translate into practice, such as determining whether immediate risks like data leakage or model bias are recognised, mitigated, or inadvertently overlooked in real-world deployments. This study investigates downstream developers' concerns, practices and perceived challenges regarding practical AI failures during the development of AI-based software. To achieve this, we conducted a mixed-method study, including interviews with 16 participants, a survey of 86 practitioners,

SEApr 12

Towards an Appropriate Level of Reliance on AI: A Preliminary Reliance-Control Framework for AI in Software Engineering

Samuel Ferino, Rashina Hoda, John Grundy et al.

How software developers interact with Artificial Intelligence (AI)-powered tools, including Large Language Models (LLMs), plays a vital role in how these AI-powered tools impact them. While overreliance on AI may lead to long-term negative consequences (e.g., atrophy of critical thinking skills); underreliance might deprive software developers of potential gains in productivity and quality. Based on twenty-two interviews with software developers on using LLMs for software development, we propose a preliminary reliance-control framework where the level of control can be used as a way to identify AI overreliance and underreliance. We also use it to recommend future research to further explore the different control levels supported by the current and emergent LLM-driven tools. Our paper contributes to the emerging discourse on AI overreliance and provides an understanding of the appropriate degree of reliance as essential to developers making the most of these powerful technologies. Our findings can help practitioners, educators, and policymakers promote responsible and effective use of AI tools.

SEJul 16, 2025

From Release to Adoption: Challenges in Reusing Pre-trained AI Models for Downstream Developers

Peerachai Banyongrakkul, Mansooreh Zahedi, Patanamon Thongtanunam et al.

Pre-trained models (PTMs) have gained widespread popularity and achieved remarkable success across various fields, driven by their groundbreaking performance and easy accessibility through hosting providers. However, the challenges faced by downstream developers in reusing PTMs in software systems are less explored. To bridge this knowledge gap, we qualitatively created and analyzed a dataset of 840 PTM-related issue reports from 31 OSS GitHub projects. We systematically developed a comprehensive taxonomy of PTM-related challenges that developers face in downstream projects. Our study identifies seven key categories of challenges that downstream developers face in reusing PTMs, such as model usage, model performance, and output quality. We also compared our findings with existing taxonomies. Additionally, we conducted a resolution time analysis and, based on statistical tests, found that PTM-related issues take significantly longer to be resolved than issues unrelated to PTMs, with significant variation across challenge categories. We discuss the implications of our findings for practitioners and possibilities for future research.

SEMay 8Code

A Dataset of Agentic AI Coding Tool Configurations

Matthias Galster, Seyedmoein Mohsenimofidi, Levi Böhme et al.

Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context Files, Skills, Rules, and Hooks. There is no curated dataset yet that captures these configurations at scale. This dataset, collected from open-source GitHub repositories, fills that gap. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 as belonging to engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,738 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,591 configuration artifacts, the full content of 18,167 configuration files associated with these configuration artifacts, and 148,519 AI-co-authored commits. The dataset and the construction pipeline are publicly available on Zenodo under CC BY 4.0. An interactive website allows researchers to browse and explore the data. This data supports research on context engineering, AI tool adoption patterns, and human-AI collaboration.

SEJan 26

Rethinking Artifact Evaluation for Software Engineering in the Age of Generative AI

Christoph Treude, Christopher M. Poskitt, Rashina Hoda

Peer review in software engineering research operates under tight time constraints, while generative AI has substantially reduced the human effort required to produce polished research narratives. Reviewer attention is often spent on aspects of submissions such as writing quality or literature positioning that have become relatively less effort-intensive to address, rather than on evaluating the scientific substance of a paper. At the same time, assessing whether methods are implemented correctly, analyses are sound, and claims are supported by evidence remains effort-intensive and dependent on human expertise. In software engineering research, this substance is frequently embodied in artifacts, including code, data, evidence and analysis samples, and experimental infrastructure. In this position paper, we argue that artifact evaluation should be treated as a first-class component of peer review. We frame peer review as an attention allocation problem, examine how generative AI weakens narrative quality as a signal of rigor, and argue that artifact evaluation should play a more prominent role in peer review decisions.

SENov 9, 2025

Walking the Tightrope of LLMs for Software Development: A Practitioners' Perspective

Samuel Ferino, Rashina Hoda, John Grundy et al.

Background: Large Language Models emerged with the potential of provoking a revolution in software development (e.g., automating processes, workforce transformation). Although studies have started to investigate the perceived impact of LLMs for software development, there is a need for empirical studies to comprehend how to balance forward and backward effects of using LLMs. Objective: We investigated how LLMs impact software development and how to manage the impact from a software developer's perspective. Method: We conducted 22 interviews with software practitioners across 3 rounds of data collection and analysis, between October (2024) and September (2025). We employed socio-technical grounded theory (STGT) for data analysis to rigorously analyse interview participants' responses. Results: We identified the benefits (e.g., maintain software development flow, improve developers' mental model, and foster entrepreneurship) and disadvantages (e.g., negative impact on developers' personality and damage to developers' reputation) of using LLMs at individual, team, organisation, and society levels; as well as best practices on how to adopt LLMs. Conclusion: Critically, we present the trade-offs that software practitioners, teams, and organisations face in working with LLMs. Our findings are particularly useful for software team leaders and IT managers to assess the viability of LLMs within their specific context.

SEApr 8

Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision

Hong Yi Lin, Chunhua Liu, Haoyu Gao et al.

In today's AI-assisted software engineering landscape, developers increasingly depend on LLMs that are highly capable, yet inherently imperfect. The tendency of these models to produce incorrect outputs can reduce developer productivity. To this end, a canonical mitigation method is to provide calibrated confidence scores that faithfully reflect their likelihood of correctness at the instance-level. Such information allows users to make immediate decisions regarding output acceptance, abstain error-prone outputs, and better align their expectations with the model's capabilities. Since post-trained LLMs do not inherently produce well-calibrated confidence scores, researchers have developed post-hoc calibration methods, with global Platt-scaling of sequence-level confidence scores proving effective in many generative software engineering tasks but remaining unreliable or unexplored for automated code revision (ACR) tasks such as program repair, vulnerability repair, and code refinement. We hypothesise that the coarse-grained nature of this conventional method makes it ill-suited for ACR tasks, where correctness is often determined by local edit decisions and miscalibration can be sample-dependent, thereby motivating fine-grained confidence calibration. To address this, our study proposes local Platt-scaling applied separately to three different fine-grained confidence scores. Through experiments across 3 separate tasks and correctness metrics, as well as 14 different models of various sizes, we find that fine-grained confidence scores consistently achieve lower calibration error across a broader range of probability intervals, and this effect is further amplified when global Platt-scaling is applied. Our proposed approaches offer a practical solution to eliciting well-calibrated confidence scores, enabling more trustworthy and streamlined usage of imperfect models in ACR tasks.

SEApr 16

An Empirical Study of API Misuses of Data-Centric Libraries

Akalanka Galappaththi, Sarah Nadi, Christoph Treude

Developers rely on third-party library Application Programming Interfaces (APIs) when developing software. However, libraries typically come with assumptions and API usage constraints, whose violation results in API misuse. API misuses may result in crashes or incorrect behavior. Even though API misuse is a well-studied area, a recent study of API misuse of deep learning libraries showed that the nature of these misuses and their symptoms are different from misuses of traditional libraries, and as a result highlighted potential shortcomings of current misuse detection tools. We speculate that these observations may not be limited to deep learning API misuses but may stem from the data-centric nature of these APIs. Data-centric libraries often deal with diverse data structures, intricate processing workflows, and a multitude of parameters, which can make them inherently more challenging to use correctly. Therefore, understanding the potential misuses of these libraries is important to avoid unexpected application behavior. To this end, this paper contributes an empirical study of API misuses of five data-centric libraries that cover areas such as data processing, numerical computation, machine learning, and visualization. We identify misuses of these libraries by analyzing data from both Stack Overflow and GitHub. Our results show that many of the characteristics of API misuses observed for deep learning libraries extend to misuses of the data-centric library APIs we study. We also find that developers tend to misuse APIs from data-centric libraries, regardless of whether the API directive appears in the documentation. Overall, our work exposes the challenges of API misuse in data-centric libraries, rather than only focusing on deep learning libraries. Our collected misuses and their characterization lay groundwork for future research to help reduce misuses of these libraries.

SEMay 6

Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap

Christoph Treude

AI coding assistants and autonomous agents are becoming integral to software development workflows, reshaping how code is produced, reviewed, and maintained. While recent research has focused mainly on the capabilities and impacts of productivity of these systems, much less attention has been paid to accountability: who is responsible when agents generate, modify, or recommend code? In practice, accountability is defined through the Terms of Service (ToS) and related policy documents that govern the use of AI-powered development tools. In this vision paper, we present a comparative analysis of the Terms of Service for widely used AI coding assistants and agent-enabled development tools. We examine how these documents allocate ownership, responsibility, liability, and disclosure obligations between tool providers and software developers, and we identify common patterns and divergences between providers. Our analysis reveals a consistent tendency to shift responsibility for correctness, safety, and legal compliance onto users, as well as substantial variation in how providers address issues such as indemnification, data reuse, and acceptable use. Based on these findings, we argue that existing policy frameworks are poorly aligned with increasingly agent-mediated and autonomous software development workflows. We outline a research roadmap for accountable agents in software engineering, identifying challenges and opportunities for modeling responsibility, designing governance artifacts, developing tooling that supports accountability, and conducting empirical studies of developers' perceptions and practices.

SEMar 20

Configuring Agentic AI Coding Tools: An Exploratory Study

Matthias Galster, Seyedmoein Mohsenimofidi, Jai Lal Lulla et al.

Agentic AI coding tools increasingly automate software development tasks. Developers can configure these tools through versioned repository-level artifacts such as Markdown and JSON files. We present a systematic analysis of configuration mechanisms for agentic AI coding tools, covering Claude Code, GitHub Copilot, Cursor, Gemini, and Codex. We identify eight configuration mechanisms spanning a spectrum from static context to executable and external integrations, and, in an empirical study of 2,923 GitHub repositories, examine whether and how they are adopted, with a detailed analysis of Context Files, Skills, and Subagents. First, Context Files dominate the configuration landscape and are often the sole mechanism in a repository, with AGENTS$.$md emerging as an interoperable standard across tools. Second, advanced mechanisms such as Skills and Subagents are only shallowly adopted. Most repositories define only one or two artifacts, and Skills predominantly rely on static instructions rather than executable workflows. Third, distinct configuration cultures are forming around different tools, with Claude Code users employing the broadest range of mechanisms. These findings establish an empirical baseline for understanding how developers configure agentic tools, suggest that AGENTS$.$md serves as a natural starting point, and motivate longitudinal and experimental research on how configuration strategies evolve and affect agent performance.

SEMar 10, 2025Code

Novice Developers' Perspectives on Adopting LLMs for Software Development: A Systematic Literature Review

Samuel Ferino, Rashina Hoda, John Grundy et al.

Following the rise of large language models (LLMs), many studies have emerged in recent years focusing on exploring the adoption of LLM-based tools for software development by novice developers: computer science/software engineering students and early-career industry developers with two years or less of professional experience. These studies have sought to understand the perspectives of novice developers on using these tools, a critical aspect of the successful adoption of LLMs in software engineering. To systematically collect and summarise these studies, we conducted a systematic literature review (SLR) following the guidelines by Kitchenham et al. on 80 primary studies published between April 2022 and June 2025 to answer four research questions (RQs). In answering RQ1, we categorised the study motivations and methodological approaches. In RQ2, we identified the software development tasks for which novice developers use LLMs. In RQ3, we categorised the advantages, challenges, and recommendations discussed in the studies. Finally, we discuss the study limitations and future research needs suggested in the primary studies in answering RQ4. Throughout the paper, we also indicate directions for future work and implications for software engineering researchers, educators, and developers. Our research artifacts are publicly available at https://github.com/Samuellucas97/SupplementaryInfoPackage-SLR.

SEJan 28

On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents

Jai Lal Lulla, Seyedmoein Mohsenimofidi, Matthias Galster et al.

AI coding agents such as Codex and Claude Code are increasingly used to autonomously contribute to software repositories. However, little is known about how repository-level configuration artifacts affect operational efficiency of the agents. In this paper, we study the impact of AGENTS$.$md files on the runtime and token consumption of AI coding agents operating on GitHub pull requests. We analyze 10 repositories and 124 pull requests, executing agents under two conditions: with and without an AGENTS$.$md file. We measure wall-clock execution time and token usage during agent execution. Our results show that the presence of AGENTS$.$md is associated with a lower median runtime ($Δ28.64$%) and reduced output token consumption ($Δ16.58$%), while maintaining a comparable task completion behavior. Based on these results, we discuss immediate implications for the configuration and deployment of AI coding agents in practice, and outline a broader research agenda on the role of repository-level instructions in shaping the behavior, efficiency, and integration of AI coding agents in software development workflows.

SEApr 17

AI Slop and the Software Commons

Sebastian Baltes, Marc Cheong, Christoph Treude

In this article, we argue that AI slop in software is creating a tragedy of the commons. Individual productivity gains from AI-generated content externalize costs onto reviewer capacity, codebase integrity, public knowledge resources, collaborative trust, and the talent pipeline. AI slop is cheap to generate and expensive to review, and the review layer is already thin. Commons problems are not solved by individual restraint. We outline concrete next steps for tool developers, team leads, and educators, grounded in Ostrom's design principles for enduring commons institutions.

SEFeb 11, 2022Code

GitHub Sponsors: Exploring a New Way to Contribute to Open Source

Naomichi Shimada, Tao Xiao, Hideaki Hata et al.

GitHub Sponsors, launched in 2019, enables donations to individual open source software (OSS) developers. Financial support for OSS maintainers and developers is a major issue in terms of sustaining OSS projects, and the ability to donate to individuals is expected to support the sustainability of developers, projects, and community. In this work, we conducted a mixed-methods study of GitHub Sponsors, including quantitative and qualitative analyses, to understand the characteristics of developers who are likely to receive donations and what developers think about donations to individuals. We found that: (1) sponsored developers are more active than non-sponsored developers, (2) the possibility to receive donations is related to whether there is someone in their community who is donating, and (3) developers are sponsoring as a new way to contribute to OSS. Our findings are the first step towards data-informed guidance for using GitHub Sponsors, opening up avenues for future work on this new way of financially sustaining the OSS community.

SEMar 1, 2021Code

How Developers Engineer Test Cases: An Observational Study

Maurício Aniche, Christoph Treude, Andy Zaidman

One of the main challenges that developers face when testing their systems lies in engineering test cases that are good enough to reveal bugs. And while our body of knowledge on software testing and automated test case generation is already quite significant, in practice, developers are still the ones responsible for engineering test cases manually. Therefore, understanding the developers' thought- and decision-making processes while engineering test cases is a fundamental step in making developers better at testing software. In this paper, we observe 13 developers thinking-aloud while testing different real-world open-source methods, and use these observations to explain how developers engineer test cases. We then challenge and augment our main findings by surveying 72 software developers on their testing practices. We discuss our results from three different angles. First, we propose a general framework that explains how developers reason about testing. Second, we propose and describe in detail the three different overarching strategies that developers apply when testing. Third, we compare and relate our observations with the existing body of knowledge and propose future studies that would advance our knowledge on the topic.

SEFeb 10, 2021Code

GitHub Discussions: An Exploratory Study of Early Adoption

Hideaki Hata, Nicole Novielli, Sebastian Baltes et al.

Discussions is a new feature of GitHub for asking questions or discussing topics outside of specific Issues or Pull Requests. Before being available to all projects in December 2020, it had been tested on selected open source software projects. To understand how developers use this novel feature, how they perceive it, and how it impacts the development processes, we conducted a mixed-methods study based on early adopters of GitHub discussions from January until July 2020. We found that: (1) errors, unexpected behavior, and code reviews are prevalent discussion categories; (2) there is a positive relationship between project member involvement and discussion frequency; (3) developers consider GitHub Discussions useful but face the problem of topic duplication between Discussions and Issues; (4) Discussions play a crucial role in advancing the development of projects; and (5) positive sentiment in Discussions is more frequent than in Stack Overflow posts. Our findings are a first step towards data-informed guidance for using GitHub Discussions, opening up avenues for future work on this novel communication channel.

SEJan 25, 2021Code

The Shifting Sands of Motivation: Revisiting What Drives Contributors in Open Source

Marco Gerosa, Igor Wiese, Bianca Trinkenreich et al.

Open Source Software (OSS) has changed drastically over the last decade, with OSS projects now producing a large ecosystem of popular products, involving industry participation, and providing professional career opportunities. But our field's understanding of what motivates people to contribute to OSS is still fundamentally grounded in studies from the early 2000s. With the changed landscape of OSS, it is very likely that motivations to join OSS have also evolved. Through a survey of 242 OSS contributors, we investigate shifts in motivation from three perspectives: (1) the impact of the new OSS landscape, (2) the impact of individuals' personal growth as they become part of OSS communities, and (3) the impact of differences in individuals' demographics. Our results show that some motivations related to social aspects and reputation increased in frequency and that some intrinsic and internalized motivations, such as learning and intellectual stimulation, are still highly relevant. We also found that contributing to OSS often transforms extrinsic motivations to intrinsic, and that while experienced contributors often shift toward altruism, novices often shift toward career, fun, kinship, and learning. OSS projects can leverage our results to revisit current strategies to attract and retain contributors, and researchers and tool builders can better support the design of new studies and tools to engage and support OSS development.

SEDec 7, 2020Code

How Successful Are Open Source Contributions From Countries with Different Levels of Human Development?

Leonardo Furtado, Bruno Cartaxo, Christoph Treude et al.

Are Brazilian developers less likely to have a contribution accepted than their peers from, say, the United Kingdom? In this paper we studied whether the developers' location relates to the outcome of a pull request. We curated the locations of 14k contributors who performed 44k pull requests to 20 open source projects. Our results indeed suggest that developers from countries with low human development indexes (HDI) not only perform a small fraction of the overall pull requests, but they also are the ones that face rejection the most.

SEApr 1, 2020Code

GitHub Repositories with Links to Academic Papers: Public Access, Traceability, and Evolution

Supatsara Wattanakriengkrai, Bodin Chinthanet, Hideaki Hata et al.

Traceability between published scientific breakthroughs and their implementation is essential, especially in the case of open-source scientific software which implements bleeding-edge science in its code. However, aligning the link between GitHub repositories and academic papers can prove difficult, and the current practice of establishing and maintaining such links remains unknown. This paper investigates the role of academic paper references contained in these repositories. We conduct a large-scale study of 20 thousand GitHub repositories that make references to academic papers. We use a mixed-methods approach to identify public access, traceability and evolutionary aspects of the links. Although referencing a paper is not typical, we find that a vast majority of referenced academic papers are public access. These repositories tend to be affiliated with academic communities. More than half of the papers do not link back to any repository. We find that academic papers from top-tier SE venues are not likely to reference a repository, but when they do, they usually link to a GitHub software repository. In a network of arXiv papers and referenced repositories, we find that the most referenced papers are (i) highly-cited in academia and (ii) are referenced by repositories written in different programming languages.

SEOct 15, 2019Code

From Academia to Software Development: Publication Citations in Source Code Comments

Akira Inokuchi, Yusuf Sulistyo Nugroho, Supatsara Wattanakriengkrai et al.

Academic publications have been evaluated in terms of their impact on research communities based on many metrics, such as the number of citations. On the other hand, the impact of academic publications on industry has been rarely studied. This paper investigates how academic publications contribute to software development by analyzing publication citations in source code comments in open source software repositories. We propose an automated approach for detecting academic publications based on Named Entity Recognition, and achieve 0.90 in $F_1$ as detection accuracy. We conduct a large-scale study of publication citations with 319,438,977 comments collected from 25,925 active repositories written in seven programming languages. Our findings indicate that academic publications can be knowledge sources for software development. These referenced publications are particularly from journals. In terms of knowledge transfer, algorithm is the most prevalent type of knowledge transferred from the publications, with proposed formulas or equations typically implemented in methods or functions in source code files. In a closer look at GitHub repositories referencing academic publications, we find that science-related repositories are the most frequent among GitHub repositories with publication citations, and that the vast majority of these publications are referenced by repository owners who are different from the publication authors. We also find that referencing older publications can lead to potential issues related to obsolete knowledge.

SEOct 13, 2019Code

Google Summer of Code: Student Motivations and Contributions

Jefferson O. Silva, Igor Wiese, Daniel M. German et al.

Several open source software (OSS) projects expect to foster newcomers' onboarding and to receive contributions by participating in engagement programs, like Summers of Code. However, there is little empirical evidence showing why students join such programs. In this paper, we study the well-established Google Summer of Code (GSoC), which is a 3-month OSS engagement program that offers stipends and mentors to students willing to contribute to OSS projects. We combined a survey (students and mentors) and interviews (students) to understand what motivates students to enter GSoC. Our results show that students enter GSoC for an enriching experience, not necessarily to become frequent contributors. Our data suggest that, while the stipends are an important motivator, the students participate for work experience and the ability to attach the name of the supporting organization to their resumés. We also discuss practical implications for students, mentors, OSS projects, and Summer of Code programs.

SEJan 22, 2019Code

9.6 Million Links in Source Code Comments: Purpose, Evolution, and Decay

Hideaki Hata, Christoph Treude, Raula Gaikovina Kula et al.

Links are an essential feature of the World Wide Web, and source code repositories are no exception. However, despite their many undisputed benefits, links can suffer from decay, insufficient versioning, and lack of bidirectional traceability. In this paper, we investigate the role of links contained in source code comments from these perspectives. We conducted a large-scale study of around 9.6 million links to establish their prevalence, and we used a mixed-methods approach to identify the links' targets, purposes, decay, and evolutionary aspects. We found that links are prevalent in source code repositories, that licenses, software homepages, and specifications are common types of link targets, and that links are often included to provide metadata or attribution. Links are rarely updated, but many link targets evolve. Almost 10% of the links included in source code comments are dead. We then submitted a batch of link-fixing pull requests to open source software repositories, resulting in most of our fixes being merged successfully. Our findings indicate that links in source code comments can indeed be fragile, and our work opens up avenues for future work to address these problems.

SEMar 28

"An Endless Stream of AI Slop": The Growing Burden of AI-Assisted Software Development

Sebastian Baltes, Marc Cheong, Christoph Treude

"AI slop", that is, low-quality AI-generated content, is increasingly affecting software development, from generated code and pull requests to documentation and bug reports. However, there is limited empirical research on how developers perceive and respond to this phenomenon. We conducted a qualitative analysis of 1,154 posts across 15 discussion threads from Reddit and Hacker News, developing a codebook of 15 codes organized into three thematic clusters: Review Friction (how AI slop burdens reviewers, erodes trust, and prompts countermeasures), Quality Degradation (damage to codebases, knowledge resources, and developer competence), and Forces and Consequences (systemic incentives, mandated adoption, craft erosion, and workforce disruption). Our findings frame AI slop as a tragedy of the commons, where individual productivity gains externalize costs onto reviewers, maintainers, and the broader community. We report the concerns developers raise and the mitigation strategies they propose, offering actionable insights for tool developers, team leads, and educators.

SEJan 15, 2025

How Developers Interact with AI: A Taxonomy of Human-AI Collaboration in Software Engineering

Christoph Treude, Marco A. Gerosa

Artificial intelligence (AI), including large language models and generative AI, is emerging as a significant force in software development, offering developers powerful tools that span the entire development lifecycle. Although software engineering research has extensively studied AI tools in software development, the specific types of interactions between developers and these AI-powered tools have only recently begun to receive attention. Understanding and improving these interactions has the potential to enhance productivity, trust, and efficiency in AI-driven workflows. In this paper, we propose a taxonomy of interaction types between developers and AI tools, identifying eleven distinct interaction types, such as auto-complete code suggestions, command-driven actions, and conversational assistance. Building on this taxonomy, we outline a research agenda focused on optimizing AI interactions, improving developer control, and addressing trust and usability challenges in AI-assisted development. By establishing a structured foundation for studying developer-AI interactions, this paper aims to stimulate research on creating more effective, adaptive AI tools for software development.

SEFeb 12, 2025

Generative AI and Empirical Software Engineering: A Paradigm Shift

Christoph Treude, Margaret-Anne Storey

The adoption of large language models (LLMs) and autonomous agents in software engineering marks an enduring paradigm shift. These systems create new opportunities for tool design, workflow orchestration, and empirical observation, while fundamentally reshaping the roles of developers and the artifacts they produce. Although traditional empirical methods remain central to software engineering research, the rapid evolution of AI introduces new data modalities, alters causal assumptions, and challenges foundational constructs such as "developer", "artifact", and "interaction". As humans and AI agents increasingly co-create, the boundaries between social and technical actors blur, and the reproducibility of findings becomes contingent on model updates and prompt contexts. This vision paper examines how the integration of LLMs into software engineering disrupts established research paradigms. We discuss how it transforms the phenomena we study, the methods and theories we rely on, the data we analyze, and the threats to validity that arise in dynamic AI-mediated environments. Our aim is to help the empirical software engineering community adapt its questions, instruments, and validation standards to a future in which AI systems are not merely tools, but active collaborators shaping software engineering and its study.

SEMar 1, 2025

Interacting with AI Reasoning Models: Harnessing "Thoughts" for AI-Driven Software Engineering

Christoph Treude, Raula Gaikovina Kula

Recent advances in AI reasoning models provide unprecedented transparency into their decision-making processes, transforming them from traditional black-box systems into models that articulate step-by-step chains of thought rather than producing opaque outputs. This shift has the potential to improve software quality, explainability, and trust in AI-augmented development. However, software engineers rarely have the time or cognitive bandwidth to analyze, verify, and interpret every AI-generated thought in detail. Without an effective interface, this transparency could become a burden rather than a benefit. In this paper, we propose a vision for structuring the interaction between AI reasoning models and software engineers to maximize trust, efficiency, and decision-making power. We argue that simply exposing AI's reasoning is not enough -- software engineers need tools and frameworks that selectively highlight critical insights, filter out noise, and facilitate rapid validation of key assumptions. To illustrate this challenge, we present motivating examples in which AI reasoning models state their assumptions when deciding which external library to use and produce divergent reasoning paths and recommendations about security vulnerabilities, highlighting the need for an interface that prioritizes actionable insights while managing uncertainty and resolving conflicts. We then outline a research roadmap for integrating automated summarization, assumption validation, and multi-model conflict resolution into software engineering workflows. Achieving this vision will unlock the full potential of AI reasoning models to enable software engineers to make faster, more informed decisions without being overwhelmed by unnecessary detail.

SEMar 12, 2025

Enhancing High-Quality Code Generation in Large Language Models with Comparative Prefix-Tuning

Yuan Jiang, Yujian Zhang, Liang Lu et al.

Large Language Models (LLMs) have been widely adopted in commercial code completion engines, significantly enhancing coding efficiency and productivity. However, LLMs may generate code with quality issues that violate coding standards and best practices, such as poor code style and maintainability, even when the code is functionally correct. This necessitates additional effort from developers to improve the code, potentially negating the efficiency gains provided by LLMs. To address this problem, we propose a novel comparative prefix-tuning method for controllable high-quality code generation. Our method introduces a single, property-specific prefix that is prepended to the activations of the LLM, serving as a lightweight alternative to fine-tuning. Unlike existing methods that require training multiple prefixes, our approach trains only one prefix and leverages pairs of high-quality and low-quality code samples, introducing a sequence-level ranking loss to guide the model's training. This comparative approach enables the model to better understand the differences between high-quality and low-quality code, focusing on aspects that impact code quality. Additionally, we design a data construction pipeline to collect and annotate pairs of high-quality and low-quality code, facilitating effective training. Extensive experiments on the Code Llama 7B model demonstrate that our method improves code quality by over 100% in certain task categories, while maintaining functional correctness. We also conduct ablation studies and generalization experiments, confirming the effectiveness of our method's components and its strong generalization capability.

SEMar 20, 2025

CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models

Hong Yi Lin, Chunhua Liu, Haoyu Gao et al.

State-of-the-art large language models (LLMs) have demonstrated impressive code generation capabilities but struggle with real-world software engineering tasks, such as revising source code to address code reviews, hindering their practical use. Code review comments are often implicit, ambiguous, and colloquial, requiring models to grasp both code and human intent. This challenge calls for evaluating large language models' ability to bridge both technical and conversational contexts. While existing work has employed the automated code refinement (ACR) task to resolve these comments, current evaluation methods fall short, relying on text matching metrics that provide limited insight into model failures and remain susceptible to training data contamination. To address these limitations, we introduce a novel evaluation benchmark, $\textbf{CodeReviewQA}$ that enables us to conduct fine-grained assessment of model capabilities and mitigate data contamination risks. In CodeReviewQA, we decompose the generation task of code refinement into $\textbf{three essential reasoning steps}$: $\textit{change type recognition}$ (CTR), $\textit{change localisation}$ (CL), and $\textit{solution identification}$ (SI). Each step is reformulated as multiple-choice questions with varied difficulty levels, enabling precise assessment of model capabilities, while mitigating data contamination risks. Our comprehensive evaluation spans 72 recently released large language models on $\textbf{900 manually curated, high-quality examples}$ across nine programming languages. Our results show that CodeReviewQA is able to expose specific model weaknesses in code review comprehension, disentangled from their generative automated code refinement results.

SENov 19, 2025

Effective Code Membership Inference for Code Completion Models via Adversarial Prompts

Yuan Jiang, Zehao Li, Shan Huang et al.

Membership inference attacks (MIAs) on code completion models offer an effective way to assess privacy risks by inferring whether a given code snippet was part of the training data. Existing black- and gray-box MIAs rely on expensive surrogate models or manually crafted heuristic rules, which limit their ability to capture the nuanced memorization patterns exhibited by over-parameterized code language models. To address these challenges, we propose AdvPrompt-MIA, a method specifically designed for code completion models, combining code-specific adversarial perturbations with deep learning. The core novelty of our method lies in designing a series of adversarial prompts that induce variations in the victim code model's output. By comparing these outputs with the ground-truth completion, we construct feature vectors to train a classifier that automatically distinguishes member from non-member samples. This design allows our method to capture richer memorization patterns and accurately infer training set membership. We conduct comprehensive evaluations on widely adopted models, such as Code Llama 7B, over the APPS and HumanEval benchmarks. The results show that our approach consistently outperforms state-of-the-art baselines, with AUC gains of up to 102%. In addition, our method exhibits strong transferability across different models and datasets, underscoring its practical utility and generalizability.

SEOct 14, 2025

Enhancing Neural Code Representation with Additional Context

Huy Nguyen, Christoph Treude, Patanamon Thongtanunam

Automated program comprehension underpins many software engineering tasks, from code summarisation to clone detection. Recent deep learning models achieve strong results but typically rely on source code alone, overlooking contextual information such as version history or structural relationships. This limits their ability to capture how code evolves and operates. We conduct an empirical study on how enriching code representations with such contextual signals affects neural model performance on key comprehension tasks. Two downstream tasks, code clone detection and code summarisation, are evaluated using SeSaMe (1,679 Java methods) and CodeSearchNet (63,259 methods). Five representative models (CodeBERT, GraphCodeBERT, CodeT5, PLBART, ASTNN) are fine-tuned under code-only and context-augmented settings. Results show that context generally improves performance: version history consistently boosts clone detection (e.g., CodeT5 +15.92% F1) and summarisation (e.g., GraphCodeBERT +5.56% METEOR), while call-graph effects vary by model and task. Combining multiple contexts yields further gains (up to +21.48% macro-F1). Human evaluation on 100 Java snippets confirms that context-augmented summaries are significantly preferred for Accuracy and Content Adequacy (p <= 0.026; |delta| up to 0.55). These findings highlight the potential of contextual signals to enhance code comprehension and open new directions for optimising contextual encoding in neural SE models.

SEJan 14, 2022

Software Engineering User Study Recruitment on Prolific: An Experience Report

Brittany Reid, Markus Wagner, Marcelo d'Amorim et al.

Online participant recruitment platforms such as Prolific have been gaining popularity in research, as they enable researchers to easily access large pools of participants. However, participant quality can be an issue; participants may give incorrect information to gain access to more studies, adding unwanted noise to results. This paper details our experience recruiting participants from Prolific for a user study requiring programming skills in Node.js, with the aim of helping other researchers conduct similar studies. We explore a method of recruiting programmer participants using prescreening validation, attention checks and a series of programming knowledge questions. We received 680 responses, and determined that 55 met the criteria to be invited to our user study. We ultimately conducted user study sessions via video calls with 10 participants. We conclude this paper with a series of recommendations for researchers.

SEOct 25, 2021

Generating GitHub Repository Descriptions: A Comparison of Manual and Automated Approaches

Jazlyn Hellman, Eunbee Jang, Christoph Treude et al.

Given the vast number of repositories hosted on GitHub, project discovery and retrieval have become increasingly important for GitHub users. Repository descriptions serve as one of the first points of contact for users who are accessing a repository. However, repository owners often fail to provide a high-quality description; instead, they use vague terms, the purpose of the repository is poorly explained, or the description is omitted entirely. In this work, we examine the current practice of writing GitHub repository descriptions. Our investigation leads to the proposal of the LSP (Language, Software technology, and Purpose) template to formulate good descriptions for GitHub repositories that are clear, concise, and informative. To understand the extent to which current automated techniques can support generating repository descriptions, we compare the performance of state-of-the-art text summarization methods on this task. Finally, our user study with GitHub users reveals that automated summarization can adequately be used for default description generation for GitHub repositories, while the descriptions which follow the LSP template offer the most effective instrument for communicating with GitHub users.

SEAug 13, 2021

Contrasting Third-Party Package Management User Experience

Syful Islam, Raula Gaikovina Kula, Christoph Treude et al.

The management of third-party package dependencies is crucial to most technology stacks, with package managers acting as brokers to ensure that a verified package is correctly installed, configured, or removed from an application. Diversity in technology stacks has led to dozens of package ecosystems with their own management features. While recent studies have shown that developers struggle to migrate their dependencies, the common assumption is that package ecosystems are used without any issue. In this study, we explore 13 package ecosystems to understand whether their features correlate with the experience of their users. By studying experience through the questions that developers ask on the question-and-answer site Stack Overflow, we find that developer questions are grouped into three themes (i.e., Package management, Input-Output, and Package Usage). Our preliminary analysis indicates that specific features are correlated with the user experience. Our work lays out future directions to investigate the trade-offs involved in designing the ideal package ecosystem.

SEJul 29, 2021

An Empirical Study of Developers' Discussions about Security Challenges of Different Programming Languages

Roland Croft, Yongzheng Xie, Mansooreh Zahedi et al.

Given programming languages can provide different types and levels of security support, it is critically important to consider security aspects while selecting programming languages for developing software systems. Inadequate consideration of security in the choice of a programming language may lead to potential ramifications for secure development. Whilst theoretical analysis of the supposed security properties of different programming languages has been conducted, there has been relatively little effort to empirically explore the actual security challenges experienced by developers. We have performed a large-scale study of the security challenges of 15 programming languages by quantitatively and qualitatively analysing the developers' discussions from Stack Overflow and GitHub. By leveraging topic modelling, we have derived a taxonomy of 18 major security challenges for 6 topic categories. We have also conducted comparative analysis to understand how the identified challenges vary regarding the different programming languages and data sources. Our findings suggest that the challenges and their characteristics differ substantially for different programming languages and data sources, i.e., Stack Overflow and GitHub. The findings provide evidence-based insights and understanding of security challenges related to different programming languages to software professionals (i.e., practitioners or researchers). The reported taxonomy of security challenges can assist both practitioners and researchers in better understanding and traversing the secure development landscape. This study highlights the importance of the choice of technology, e.g., programming language, in secure software engineering. Hence, the findings are expected to motivate practitioners to consider the potential impact of the choice of programming languages on software security.

SEJun 23, 2021

What makes a good Node.js package? Investigating Users, Contributors, and Runnability

Bodin Chinthanet, Brittany Reid, Christoph Treude et al.

The Node.js Package Manager (i.e., npm) archive repository serves as a critical part of the JavaScript community and helps support one of the largest developer ecosystems in the world. However, as a developer, selecting an appropriate npm package to use or contribute to can be difficult. To understand what features users and contributors consider important when searching for a good npm package, we conduct a survey asking Node.js developers to evaluate the importance of 30 features derived from existing work, including GitHub activity, software usability, and properties of the repository and documentation. We identify that both user and contributor perspectives share similar views on which features they use to assess package quality. We then extract the 30 features from 104,364 npm packages and analyse the correlations between them, including three software features that measure package ``runnability"; ability to install, build, and execute a unit test. We identify which features are negatively correlated with runnability-related features and find that predicting the runnability of packages is viable. Our study lays the groundwork for future work on understanding how users and contributors select appropriate npm packages.

SEApr 19, 2021

What's behind tight deadlines? Business causes of technical debt

Rodrigo Rebouças de Almeida, Christoph Treude, Uirá Kulesza

What are the business causes behind tight deadlines? What drives the prioritization of features that pushes quality matters to the back burner? We conducted a survey with 71 experienced practitioners and did a thematic analysis of the open-ended answers to the question: ``Could you give examples of how business may contribute to technical debt?'' Business-related causes were organized into two categories: pure-business and business/IT gap, and they were related to `tight deadlines' and `features over quality', the most frequently cited management reasons for technical debt. We contribute a cause-effect model which relates the various business causes of tight deadlines and the behavior of prioritizing features over quality aspects.