Thomas Zimmermann

SE
h-index12
26papers
1,013citations
Novelty25%
AI Score45

26 Papers

SEJan 10, 2023
Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Toufique Ahmed, Supriyo Ghosh, Chetan Bansal et al. · cmu, ibm-research

Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.

SEOct 3, 2023
Can GPT-4 Replicate Empirical Software Engineering Research?

Jenny T. Liang, Carmen Badea, Christian Bird et al. · cmu

Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners could benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodologies and subtle nuances in software engineering data. Given that large language models (LLMs), such as GPT-4, show promise in tackling both software engineering- and science-related tasks, these models could help replicate and thus democratize empirical software engineering research. In this paper, we examine GPT-4's abilities to perform replications of empirical software engineering research on new data. We study their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggles to generate ones that apply common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams.

SEMar 24
SE Journals in 2036: Looking Back at the Future We Need to Have

Tim Menzies, Paris Avgeriou, Robert Feldt et al.

In 2025, SE publishing faces an existential crisis of scalability. As our communities swell globally and integrate fast-moving methodologies like LLMs, traditional peer-review practices are collapsing under the strain. The "bureaucratic anomaly" of monolithic review has become mathematically unsustainable, creating a stochastic "lottery" that punishes novelty and exhausts researchers. This paper, written from the perspective of 2036, documents potential solutions. Here, the editors of ASE, EMSE, IST, JSS, TOSEM and TSE dream a collective dream of a brighter future. In summary first we stopped fighting (The Journal Alliance). Then we fixed the process (The Lottery / Unbundling / Fixing the Benchmark Graveyard). And then we fixed the culture (Cathedrals/Bazaars).

SEApr 13
Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape

Bianca Trinkenreich, Fabio Calefato, Kelly Blincoe et al.

Context: Software engineering (SE) researchers increasingly study Generative AI (GenAI) while also incorporating it into their own research practices. Despite rapid adoption, there is limited empirical evidence on how GenAI is used in SE research and its implications for research practices and governance. Aims: We conduct a large-scale survey of 457 SE researchers publishing in top venues between 2023 and 2025. Method: Using quantitative and qualitative analyses, we examine who uses GenAI and why, where it is used across research activities, and how researchers perceive its benefits, opportunities, challenges, risks, and governance. Results: GenAI use is widespread, with many researchers reporting pressure to adopt and align their work with it. Usage is concentrated in writing and early-stage activities, while methodological and analytical tasks remain largely human-driven. Although productivity gains are widely perceived, concerns about trust, correctness, and regulatory uncertainty persist. Researchers highlight risks such as inaccuracies and bias, emphasize mitigation through human oversight and verification, and call for clearer governance, including guidance on responsible use and peer review. Conclusion: We provide a fine-grained, SE-specific characterization of GenAI use across research activities, along with taxonomies of GenAI use cases for research and peer review, opportunities, risks, mitigation strategies, and governance needs. These findings establish an empirical baseline for the responsible integration of GenAI into academic practice.

SEFeb 15, 2022Code
Attracting and Retaining OSS Contributors with a Maintainer Dashboard

Mariam Guizani, Thomas Zimmermann, Anita Sarma et al.

Tools and artifacts produced by open source software (OSS) have been woven into the foundation of the technology industry. To keep this foundation intact, the open source community needs to actively invest in sustainable approaches to bring in new contributors and nurture existing ones. We take a first step at this by collaboratively designing a maintainer dashboard that provides recommendations on how to attract and retain open source contributors. For example, by highlighting project goals (e.g., a social good cause) to attract diverse contributors and mechanisms to acknowledge (e.g., a "rising contributor" badge) existing contributors. Next, we conduct a project-specific evaluation with maintainers to better understand use cases in which this tool will be most helpful at supporting their plans for growth. From analyzing feedback, we find recommendations to be useful at signaling projects as welcoming and providing gentle nudges for maintainers to proactively recognize emerging contributors. However, there are complexities to consider when designing recommendations such as the project current development state (e.g., deadlines, milestones, refactoring) and governance model. Finally, we distill our findings to share what the future of recommendations in open source looks like and how to make these recommendations most meaningful over time.

CRDec 19, 2021Code
What are Weak Links in the npm Supply Chain?

Nusrat Zahan, Thomas Zimmermann, Patrice Godefroid et al.

Modern software development frequently uses third-party packages, raising the concern of supply chain security attacks. Many attackers target popular package managers, like npm, and their users with supply chain attacks. In 2021 there was a 650% year-on-year growth in security attacks by exploiting Open Source Software's supply chain. Proactive approaches are needed to predict package vulnerability to high-risk supply chain attacks. The goal of this work is to help software developers and security specialists in measuring npm supply chain weak link signals to prevent future supply chain attacks by empirically studying npm package metadata. In this paper, we analyzed the metadata of 1.63 million JavaScript npm packages. We propose six signals of security weaknesses in a software supply chain, such as the presence of install scripts, maintainer accounts associated with an expired email domain, and inactive packages with inactive maintainers. One of our case studies identified 11 malicious packages from the install scripts signal. We also found 2,818 maintainer email addresses associated with expired domains, allowing an attacker to hijack 8,494 packages by taking over the npm accounts. We obtained feedback on our weak link signals through a survey responded to by 470 npm package developers. The majority of the developers supported three out of our six proposed weak link signals. The developers also indicated that they would want to be notified about weak links signals before using third-party packages. Additionally, we discussed eight new signals suggested by package developers.

SEApr 26, 2021Code
Leaving My Fingerprints: Motivations and Challenges of Contributing to OSS for Social Good

Yu Huang, Denae Ford, Thomas Zimmermann

When inspiring software developers to contribute to open source software, the act is often referenced as an opportunity to build tools to support the developer community. However, that is not the only charge that propels contributions -- growing interest in open source has also been attributed to software developers deciding to use their technical skills to benefit a common societal good. To understand how developers identify these projects, their motivations for contributing, and challenges they face, we conducted 21 semi-structured interviews with OSS for Social Good (OSS4SG) contributors. From our interview analysis, we identified themes of contribution styles that we wanted to understand at scale by deploying a survey to over 5765 OSS and Open Source Software for Social Good contributors. From our quantitative analysis of 517 responses, we find that the majority of contributors demonstrate a distinction between OSS4SG and OSS. Likewise, contributors described definitions based on what societal issue the project was to mitigate and who the outcomes of the project were going to benefit. In addition, we find that OSS4SG contributors focus less on benefiting themselves by padding their resume with new technology skills and are more interested in leaving their mark on society at statistically significant levels. We also find that OSS4SG contributors evaluate the owners of the project significantly more than OSS contributors. These findings inform implications to help contributors identify high societal impact projects, help project maintainers reduce barriers to entry, and help organizations understand why contributors are drawn to these projects to sustain active participation.

SEMar 5, 2021Code
Anomalicious: Automated Detection of Anomalous and Potentially Malicious Commits on GitHub

Danielle Gonzalez, Thomas Zimmermann, Patrice Godefroid et al.

Security is critical to the adoption of open source software (OSS), yet few automated solutions currently exist to help detect and prevent malicious contributions from infecting open source repositories. On GitHub, a primary host of OSS, repositories contain not only code but also a wealth of commit-related and contextual metadata - what if this metadata could be used to automatically identify malicious OSS contributions? In this work, we show how to use only commit logs and repository metadata to automatically detect anomalous and potentially malicious commits. We identify and evaluate several relevant factors which can be automatically computed from this data, such as the modification of sensitive files, outlier change properties, or a lack of trust in the commit's author. Our tool, Anomalicious, automatically computes these factors and considers them holistically using a rule-based decision model. In an evaluation on a data set of 15 malware-infected repositories, Anomalicious showed promising results and identified 53.33% of malicious commits, while flagging less than 1% of commits for most repositories. Additionally, the tool found other interesting anomalies that are not related to malicious commits in an analysis of repositories with no known malicious commits.

ARApr 30
Autoformalizing Memory Specifications with Agents

Jan Ole Ernst, Dmitri Michelangelo Saberi, Derek Christ et al.

The primary goal of Design Verification (DV) is to ensure that a proposed chip design implementation (either in code, or physical form) exactly matches its specification and is free of functional errors in order to avoid costly re-designs. Achieving this often demands extensive manual interpretation, translating the specification document into a formal, testable representation. While AI has made progress in DV, current approaches typically focus on narrow, isolated tasks rather than full end-to-end specification compliance of modern chip designs, failing to capture the complexity of real-world verification. Our method automatically formalizes natural language memory chip specifications, for industry relevant Dynamic Random Access Memory (DRAM) standards, into a formal representation called DRAMPyML that can be used for downstream DV tasks like the generation of SystemVerilog assertions, stimulus, and functional coverage. We also release our benchmarking dataset, DRAMBench, which can be used to evaluate the evolution of model capabilities (and new approaches) at hardware autoformalization.

SEOct 15, 2024
Beyond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products

Nadia Nahar, Christian Kästner, Jenna Butler et al.

Large Language Models (LLMs) are increasingly embedded into software products across diverse industries, enhancing user experiences, but at the same time introducing numerous challenges for developers. Unique characteristics of LLMs force developers, who are accustomed to traditional software development and evaluation, out of their comfort zones as the LLM components shatter standard assumptions about software systems. This study explores the emerging solutions that software developers are adopting to navigate the encountered challenges. Leveraging a mixed-method research, including 26 interviews and a survey with 332 responses, the study identifies 19 emerging solutions regarding quality assurance that practitioners across several product teams at Microsoft are exploring. The findings provide valuable insights that can guide the development and evaluation of LLM-based products more broadly in the face of these challenges.

SEFeb 21, 2025
Time Warp: The Gap Between Developers' Ideal vs Actual Workweeks in an AI-Driven Era

Sukrit Kumar, Drishti Goel, Thomas Zimmermann et al.

Software developers balance a variety of different tasks in a workweek, yet the allocation of time often differs from what they consider ideal. Identifying and addressing these deviations is crucial for organizations aiming to enhance the productivity and well-being of the developers. In this paper, we present the findings from a survey of 484 software developers at Microsoft, which aims to identify the key differences between how developers would like to allocate their time during an ideal workweek versus their actual workweek. Our analysis reveals significant deviations between a developer's ideal workweek and their actual workweek, with a clear correlation: as the gap between these two workweeks widens, we observe a decline in both productivity and satisfaction. By examining these deviations in specific activities, we assess their direct impact on the developers' satisfaction and productivity. Additionally, given the growing adoption of AI tools in software engineering, both in the industry and academia, we identify specific tasks and areas that could be strong candidates for automation. In this paper, we make three key contributions: 1) We quantify the impact of workweek deviations on developer productivity and satisfaction 2) We identify individual tasks that disproportionately affect satisfaction and productivity 3) We provide actual data-driven insights to guide future AI automation efforts in software engineering, aligning them with the developers' requirements and ideal workflows for maximizing their productivity and satisfaction.

SENov 8, 2021
How Developers and Managers Define and Trade Productivity for Quality

Margaret-Anne Storey, Brian Houck, Thomas Zimmermann

In this paper, we present the findings from a survey study to investigate how developers and managers define and trade-off developer productivity and software quality (two related lenses into software development). We found that developers and managers, as cohorts, are not well aligned in their views of what it means to be productive (developers think of productivity in terms of activity, while more managers think of productivity in terms of performance). We also found that developers are not accurate at predicting their managers' views of productivity. In terms of quality, we found that individual developers and managers have quite varied views of what quality means to them, but as cohorts they are closely aligned in their different views, with the majority in both groups defining quality in terms of robustness. Over half of the developers and managers reported that quality can be traded for higher productivity and why this trade-off can be justified, while one third consider quality as a necessary part of productivity that cannot be traded. We also present a new descriptive framework for quality, TRUCE, that we synthesize from the survey responses. We call for more discussion between developers and managers about what they each consider as important software quality attributes, and to have open debate about how software quality relates to developer productivity and what trade-offs should or should not be made.

SEOct 15, 2021
Nalanda: A Socio-Technical Graph for Building Software Analytics Tools at Enterprise Scale

Chandra Maddila, Suhas Shanbhogue, Apoorva Agrawal et al.

Software development is information-dense knowledge work that requires collaboration with other developers and awareness of artifacts such as work items, pull requests, and files. With the speed of development increasing, information overload is a challenge for people developing and maintaining these systems. Finding information and people is difficult for software engineers, especially when they work in large software systems or have just recently joined a project. In this paper, we build a large scale data platform named Nalanda platform, which contains two subsystems: 1. A large scale socio-technical graph system, named Nalanda graph system 2. A large scale recommendation system, named Nalanda index system that aims at satisfying the information needs of software developers. The Nalanda graph is an enterprise scale graph with data from 6,500 repositories, with 37,410,706 nodes and 128,745,590 edges. On top of the Nalanda graph system, we built software analytics applications including a newsfeed named MyNalanda, and based on organic growth alone, it has Daily Active Users (DAU) of 290 and Monthly Active Users (MAU) of 590. A preliminary user study shows that 74% of developers and engineering managers surveyed are favorable toward continued use of the platform for information discovery. The Nalanda index system constitutes two indices: artifact index and expert index. It uses the socio-technical graph (Nalanda graph system) to rank the results and provide better recommendations to software developers. A large scale quantitative evaluation shows that the Nalanda index system provides recommendations with an accuracy of 78% for the top three recommendations.

SESep 13, 2021
Developers Who Vlog: Dismantling Stereotypes through Community and Identity

Souti Chattopadhyay, Denae Ford, Thomas Zimmermann

Developers are more than "nerds behind computers all day", they lead a normal life, and not all take the traditional path to learn programming. However, the public still sees software development as a profession for "math wizards". To learn more about this special type of knowledge worker from their first-person perspective, we conducted three studies to learn how developers describe a day in their life through vlogs on YouTube and how these vlogs were received by the broader community. We first interviewed 16 developers who vlogged to identify their motivations for creating this content and their intention behind what they chose to portray. Second, we analyzed 130 vlogs (video blogs) to understand the range of the content conveyed through videos. Third, we analyzed 1176 comments from the 130 vlogs to understand the impact the vlogs have on the audience. We found that developers were motivated to promote and build a diverse community, by sharing different aspects of life that define their identity, and by creating awareness about learning and career opportunities in computing. They used vlogs to share a variety of how software developers work and live -- showcasing often unseen experiences, including intimate moments from their personal life. From our comment analysis, we found that the vlogs were valuable to the audience to find information and seek advice. Commenters sought opportunities to connect with others over shared triumphs and trials they faced that were also shown in the vlogs. As a central theme, we found that developers use vlogs to challenge the misconceptions and stereotypes around their identity, work-life, and well-being. These social stigmas are obstacles to an inclusive and accepting community and can deter people from choosing software development as a career. We also discuss the implications of using vlogs to support developers, researchers, and beyond.

SEAug 12, 2021
Automating the Removal of Obsolete TODO Comments

Zhipeng Gao, Xin Xia, David Lo et al.

TODO comments are very widely used by software developers to describe their pending tasks during software development. However, after performing the task developers sometimes neglect or simply forget to remove the TODO comment, resulting in obsolete TODO comments. These obsolete TODO comments can confuse development teams and may cause the introduction of bugs in the future, decreasing the software's quality and maintainability. In this work, we propose a novel model, named TDCleaner (TODO comment Cleaner), to identify obsolete TODO comments in software projects. TDCleaner can assist developers in just-in-time checking of TODO comments status and avoid leaving obsolete TODO comments. Our approach has two main stages: offline learning and online prediction. During offline learning, we first automatically establish <code_change, todo_comment, commit_msg> training samples and leverage three neural encoders to capture the semantic features of TODO comment, code change and commit message respectively. TDCleaner then automatically learns the correlations and interactions between different encoders to estimate the final status of the TODO comment. For online prediction, we check a TODO comment's status by leveraging the offline trained model to judge the TODO comment's likelihood of being obsolete. We built our dataset by collecting TODO comments from the top-10,000 Python and Java Github repositories and evaluated TDCleaner on them. Extensive experimental results show the promising performance of our model over a set of benchmarks. We also performed an in-the-wild evaluation with real-world software projects, we reported 18 obsolete TODO comments identified by TDCleaner to Github developers and 9 of them have already been confirmed and removed by the developers, demonstrating the practical usage of our approach.

SEJul 14, 2021
Reel Life vs. Real Life: How Software Developers Share Their Daily Life through Vlogs

Souti Chattopadhyay, Thomas Zimmermann, Denae Ford

Software developers are turning to vlogs (video blogs) to share what a day is like to walk in their shoes. Through these vlogs developers share a rich perspective of their technical work as well their personal lives. However, does the type of activities portrayed in vlogs differ from activities developers in the industry perform? Would developers at a software company prefer to show activities to different extents if they were asked to share about their day through vlogs? To answer these questions, we analyzed 130 vlogs by software developers on YouTube and conducted a survey with 335 software developers at a large software company. We found that although vlogs present traditional development activities such as coding and code peripheral activities (11%), they also prominently feature wellness and lifestyle related activities (47.3%) that have not been reflected in previous software engineering literature. We also found that developers at the software company were inclined to share more non-coding tasks (e.g., personal projects, time spent with family and friends, and health) when asked to create a mock-up vlog to promote diversity. These findings demonstrate a shift in our understanding of how software developers are spending their time and find valuable to share publicly. We discuss how vlogs provide a more complete perspective of software development work and serve as a valuable source of data for empirical research.

SEMar 16, 2021
Accessibility in Software Practice: A Practitioner's Perspective

Tingting Bi, Xin Xia, David Lo et al.

Being able to access software in daily life is vital for everyone, and thus accessibility is a fundamental challenge for software development. However, given the number of accessibility issues reported by many users, e.g., in app reviews, it is not clear if accessibility is widely integrated into current software projects and how software projects address accessibility issues. In this paper, we report a study of the critical challenges and benefits of incorporating accessibility into software development and design. We applied a mixed qualitative and quantitative approach for gathering data from 15 interviews and 365 survey respondents from 26 countries across five continents to understand how practitioners perceive accessibility development and design in practice. We got 44 statements grouped into eight topics on accessibility from practitioners' viewpoints and different software development stages. Our statistical analysis reveals substantial gaps between groups, e.g., practitioners have Direct v.s. Indirect accessibility relevant work experience when they reviewed the summarized statements. These gaps might hinder the quality of accessibility development and design, and we use our findings to establish a set of guidelines to help practitioners be aware of accessibility challenges and benefit factors. We also propose some remedies to resolve the gaps and to highlight key future research directions.

SEJan 14, 2021
"How Was Your Weekend?" Software Development Teams Working From Home During COVID-19

Courtney Miller, Paige Rodeghero, Margaret-Anne Storey et al.

The mass shift to working at home during the COVID-19 pandemic radically changed the way many software development teams collaborate and communicate. To investigate how team culture and team productivity may also have been affected, we conducted two surveys at a large software company. The first, an exploratory survey during the early months of the pandemic with 2,265 developer responses, revealed that many developers faced challenges reaching milestones and that their team productivity had changed. We also found through qualitative analysis that important team culture factors such as communication and social connection had been affected. For example, the simple phrase "How was your weekend?" had become a subtle way to show peer support. In our second survey, we conducted a quantitative analysis of the team cultural factors that emerged from our first survey to understand the prevalence of the reported changes. From 608 developer responses, we found that 74% of these respondents missed social interactions with colleagues and 51% reported a decrease in their communication ease with colleagues. We used data from the second survey to build a regression model to identify important team culture factors for modeling team productivity. We found that the ability to brainstorm with colleagues, difficulty communicating with colleagues, and satisfaction with interactions from social activities are important factors that are associated with how developers report their software development team's productivity. Our findings inform how managers and leaders in large software companies can support sustained team productivity during times of crisis and beyond.

SEDec 14, 2020
Mind the Gap: On the Relationship Between Automatically Measured and Self-Reported Productivity

Moritz Beller, Vince Orgovan, Spencer Buja et al.

To improve software developers' productivity has been the holy grail of software engineering research. But before we can claim to have improved it, we must first be able to measure productivity. This is far from trivial. In fact, two separate research lines on software engineers' productivity have co-existed almost in complete isolation for a long time: automated product and process measures on the one hand and self-reported or perceived productivity on the other hand. In this article, we bridge the gap between the two with an empirical study of 81 software developers at Microsoft.

SENov 16, 2020
Please Turn Your Cameras On: Remote Onboarding of Software Developers during a Pandemic

Paige Rodeghero, Thomas Zimmermann, Brian Houck et al.

The COVID-19 pandemic has impacted the way that software development teams onboard new hires. Previously, most software developers worked in physical offices and new hires onboarded to their teams in the physical office, following a standard onboarding process. However, when companies transitioned employees to work from home due to the pandemic, there was little to no time to develop new onboarding procedures. In this paper, we present a survey of 267 new hires at Microsoft that onboarded to software development teams during the pandemic. We explored their remote onboarding process, including the challenges that the new hires encountered and their social connectedness with their teams. We found that most developers onboarded remotely and never had an opportunity to meet their teammates in person. This leads to one of the biggest challenges faced by these new hires, building a strong social connection with their team. We use these results to provide recommendations for onboarding remote hires.

SENov 10, 2020
How do Practitioners Perceive the Relevance of Requirements Engineering Research?

Xavier Franch, Daniel Mendez, Andreas Vogelsang et al.

The relevance of Requirements Engineering (RE) research to practitioners is vital for a long-term dissemination of research results to everyday practice. Some authors have speculated about a mismatch between research and practice in the RE discipline. However, there is not much evidence to support or refute this perception. This paper presents the results of a study aimed at gathering evidence from practitioners about their perception of the relevance of RE research and at understanding the factors that influence that perception. We conducted a questionnaire-based survey of industry practitioners with expertise in RE. The participants rated the perceived relevance of 435 scientific papers presented at five top RE-related conferences. The 153 participants provided a total of 2,164 ratings. The practitioners rated RE research as essential or worthwhile in a majority of cases. However, the percentage of non-positive ratings is still higher than we would like. Among the factors that affect the perception of relevance are the research's links to industry, the research method used, and respondents' roles. The reasons for positive perceptions were primarily related to the relevance of the problem and the soundness of the solution, while the causes for negative perceptions were more varied. The respondents also provided suggestions for future research, including topics researchers have studied for decades, like elicitation or requirement quality criteria.

SEAug 25, 2020
A Tale of Two Cities: Software Developers Working from Home During the COVID-19 Pandemic

Denae Ford, Margaret-Anne Storey, Thomas Zimmermann et al.

The COVID-19 pandemic has shaken the world to its core and has provoked an overnight exodus of developers that normally worked in an office setting to working from home. The magnitude of this shift and the factors that have accompanied this new unplanned work setting go beyond what the software engineering community has previously understood to be remote work. To find out how developers and their productivity were affected, we distributed two surveys (with a combined total of 3,634 responses that answered all required questions) -- weeks apart to understand the presence and prevalence of the benefits, challenges, and opportunities to improve this special circumstance of remote work. From our thematic qualitative analysis and statistical quantitative analysis, we find that there is a dichotomy of developer experiences influenced by many different factors (that for some are a benefit, while for others a challenge). For example, a benefit for some was being close to family members but for others having family members share their working space and interrupting their focus, was a challenge. Our surveys led to powerful narratives from respondents and revealed the scale at which these experiences exist to provide insights as to how the future of (pandemic) remote work can evolve.

SEJul 10, 2020
Neural Knowledge Extraction From Cloud Service Incidents

Manish Shetty, Chetan Bansal, Sumit Kumar et al.

In the last decade, two paradigm shifts have reshaped the software industry - the move from boxed products to services and the widespread adoption of cloud computing. This has had a huge impact on the software development life cycle and the DevOps processes. Particularly, incident management has become critical for developing and operating large-scale services. Incidents are created to ensure timely communication of service issues and, also, their resolution. Prior work on incident management has been heavily focused on the challenges with incident triaging and de-duplication. In this work, we address the fundamental problem of structured knowledge extraction from service incidents. We have built SoftNER, a framework for unsupervised knowledge extraction from service incidents. We frame the knowledge extraction problem as a Named-entity Recognition task for extracting factual information. SoftNER leverages structural patterns like key,value pairs and tables for bootstrapping the training data. Further, we build a novel multi-task learning based BiLSTM-CRF model which leverages not just the semantic context but also the data-types for named-entity extraction. We have deployed SoftNER at Microsoft, a major cloud service provider and have evaluated it on more than 2 months of cloud incidents. We show that the unsupervised machine learning based approach has a high precision of 0.96. Our multi-task learning based deep learning model also outperforms the state of the art NER models. Lastly, using the knowledge extracted by SoftNER we are able to build significantly more accurate models for important downstream tasks like incident triaging.

SEMay 30, 2020
An Empirical Study of Software Exceptions in the Field using Search Logs

Foyzul Hassan, Chetan Bansal, Nachiappan Nagappan et al.

Software engineers spend a substantial amount of time using Web search to accomplish software engineering tasks. Such search tasks include finding code snippets, API documentation, seeking help with debugging, etc. While debugging a bug or crash, one of the common practices of software engineers is to search for information about the associated error or exception traces on the internet. In this paper, we analyze query logs from a leading commercial general-purpose search engine (GPSE) such as Google, Yahoo! or Bing to carry out a large scale study of software exceptions. To the best of our knowledge, this is the first large scale study to analyze how Web search is used to find information about exceptions. We analyzed about 1 million exception related search queries from a random sample of 5 billion web search queries. To extract exceptions from unstructured query text, we built a novel and high-performance machine learning model with a F1-score of 0.82. Using the machine learning model, we extracted exceptions from raw queries and performed popularity, effort, success, query characteristic and web domain analysis. We also performed programming language-specific analysis to give a better view of the exception search behavior. These techniques can help improve existing methods, documentation and tools for exception analysis and prediction. Further, similar techniques can be applied for APIs, frameworks, etc.

SEDec 19, 2019
Analyzing Web Search Behavior for Software Engineering Tasks

Nikitha Rao, Chetan Bansal, Thomas Zimmermann et al.

Web search plays an integral role in software engineering (SE) to help with various tasks such as finding documentation, debugging, installation, etc. In this work, we present the first large-scale analysis of web search behavior for SE tasks using the search query logs from Bing, a commercial web search engine. First, we use distant supervision techniques to build a machine learning classifier to extract the SE search queries with an F1 score of 93%. We then perform an analysis on one million search sessions to understand how software engineering related queries and sessions differ from other queries and sessions. Subsequently, we propose a taxonomy of intents to identify the various contexts in which web search is used in software engineering. Lastly, we analyze millions of SE queries to understand the distribution, search metrics and trends across these SE search intents. Our analysis shows that SE related queries form a significant portion of the overall web search traffic. Additionally, we found that there are six major intent categories for which web search is used in software engineering. The techniques and insights can not only help improve existing tools but can also inspire the development of new tools that aid in finding information for SE related tasks.

SEJan 17, 2019
Mining Treatment-Outcome Constructs from Sequential Software Engineering Data

Maleknaz Nayebi, Guenther Ruhe, Thomas Zimmermann

Many investigations in empirical software engineering look at sequences of data resulting from development or management processes. In this paper, we propose an analytical approach called the Gandhi-Washington Method (GWM) to investigate the impact of recurring events in software projects. GWM takes an encoding of events and activities provided by a software analyst as input. It uses regular expressions to automatically condense and summarize information and infer treatments. Relating the treatments to the outcome through statistical tests, treatment-outcome constructs are automatically mined from the data. The output of GWM is a set of treatment-outcome constructs. Each treatment in the set of mined constructs is significantly different from the other treatments considering the impact on the outcome and/or is structurally different from other treatments considering the sequence of events. We describe GWM and classes of problems to which GWM can be applied. We demonstrate the applicability of this method for empirical studies on sequences of file editing, code ownership, and release cycle time.