SELGFeb 6, 2025

Combining Language and App UI Analysis for the Automated Assessment of Bug Reproduction Steps

arXiv:2502.04251v18 citationsh-index: 56ICPC
Originality Incremental advance
AI Analysis

This addresses the issue of inefficient bug resolution for software developers by improving automated S2R quality assessment, though it is incremental as it builds on prior work with a novel hybrid approach.

The paper tackles the problem of automated assessment of bug reproduction steps (S2Rs) in bug reports, which are often unclear or incomplete, by proposing AstroBR, a technique that combines language understanding with app UI analysis, resulting in a 25.2% improvement in F1 score for annotation and 71.4% better accuracy for suggesting missing steps compared to a baseline.

Bug reports are essential for developers to confirm software problems, investigate their causes, and validate fixes. Unfortunately, reports often miss important information or are written unclearly, which can cause delays, increased issue resolution effort, or even the inability to solve issues. One of the most common components of reports that are problematic is the steps to reproduce the bug(s) (S2Rs), which are essential to replicate the described program failures and reason about fixes. Given the proclivity for deficiencies in reported S2Rs, prior work has proposed techniques that assist reporters in writing or assessing the quality of S2Rs. However, automated understanding of S2Rs is challenging, and requires linking nuanced natural language phrases with specific, semantically related program information. Prior techniques often struggle to form such language to program connections - due to issues in language variability and limitations of information gleaned from program analyses. To more effectively tackle the problem of S2R quality annotation, we propose a new technique called AstroBR, which leverages the language understanding capabilities of LLMs to identify and extract the S2Rs from bug reports and map them to GUI interactions in a program state model derived via dynamic analysis. We compared AstroBR to a related state-of-the-art approach and we found that AstroBR annotates S2Rs 25.2% better (in terms of F1 score) than the baseline. Additionally, AstroBR suggests more accurate missing S2Rs than the baseline (by 71.4% in terms of F1 score).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes