CLSEApr 15, 2022

Is Surprisal in Issue Trackers Actionable?

CambridgeMicrosoft
arXiv:2204.07363v12 citationsh-index: 41
Originality Incremental advance
AI Analysis

This addresses the need for early detection of problematic events in software development to save effort and cost, though it is incremental over prior code metrics-based methods.

The study tackled the problem of detecting unusual events in software repositories by using surprisal from language models, finding that surprising issues and pull requests correlate with higher importance and resolution difficulty, with correlations up to 0.6 in some cases.

Background. From information theory, surprisal is a measurement of how unexpected an event is. Statistical language models provide a probabilistic approximation of natural languages, and because surprisal is constructed with the probability of an event occuring, it is therefore possible to determine the surprisal associated with English sentences. The issues and pull requests of software repository issue trackers give insight into the development process and likely contain the surprising events of this process. Objective. Prior works have identified that unusual events in software repositories are of interest to developers, and use simple code metrics-based methods for detecting them. In this study we will propose a new method for unusual event detection in software repositories using surprisal. With the ability to find surprising issues and pull requests, we intend to further analyse them to determine if they actually hold importance in a repository, or if they pose a significant challenge to address. If it is possible to find bad surprises early, or before they cause additional troubles, it is plausible that effort, cost and time will be saved as a result. Method. After extracting the issues and pull requests from 5000 of the most popular software repositories on GitHub, we will train a language model to represent these issues. We will measure their perceived importance in the repository, measure their resolution difficulty using several analogues, measure the surprisal of each, and finally generate inferential statistics to describe any correlations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes