52.1SEApr 11Code
Fine-grained Multi-Document Extraction and Generation of Code Change RationaleMehedi Sun, Antu Saha, Nadeeshan De Silva et al.
Understanding the reasons behind past code changes is critical for many software engineering tasks, including refactoring and reviewing code, diagnosing bugs, and implementing new features. Unfortunately, locating and reconstructing this rationale can be difficult for developers because the information is often fragmented, inconsistently documented, and scattered across different artifacts such as commit messages, issue reports, and PRs. In this paper, we address this challenge in two steps. First, we conduct an empirical study of 63 commits from five open-source Java projects to analyze how rationale components (e.g., a change's goal, need, and alternative) are distributed across artifacts. We find that the rationale is highly fragmented: commit messages and pull requests primarily capture goals, while needs and alternatives are more often found in issues and PRs. Other components are scarce but found in artifacts other than commit messages. No single artifact type captures all components, underscoring the need for cross-document reasoning and synthesis. Second, we introduce ARGUS, an LLM-based approach that identifies sentences expressing goal, need, and alternative across a commit's artifacts and creates concise rationale summaries to support code comprehension and maintenance tasks. We evaluated ARGUS on the 63 commits and compared its performance against baseline variants. The best-performing version achieved 51.4% precision and 93.2% recall for rationale identification, while producing rationale summaries rated as accurate. A user study with 12 Java developers further showed that these summaries were perceived as useful and helpful for tasks such as code review, documentation, and debugging. Our results highlight the need for multi-document reasoning in capturing rationale and demonstrate the potential of ARGUS to help developers understand and maintain software systems.
50.3SEMar 23
Evaluating Language Model Applications for Identifying Solution-Related Content in Issue Report DiscussionsAntu Saha, Mehedi Sun, Oscar Chaparro
During issue resolution, software developers rely on issue reports to discuss solutions for defects, feature requests, and other changes. These discussions contain proposed solutions--from design changes to code implementations--as well as their evaluations. Locating solution-related content is essential for investigating reopened issues, addressing regressions, reusing solutions, and understanding code change rationale. Manually understanding long discussions to identify such content can be difficult and time-consuming. This paper automates solution identification using language models as supervised classifiers. We investigate three applications--embeddings, prompting, and fine-tuning--across three classifier types: traditional ML models (MLMs), pre-trained language models (PLMs), and large language models (LLMs). Using 356 Mozilla Firefox issues, we created a dataset to train and evaluate six MLMs, four PLMs, and two LLMs across 68 configurations. Results show that MLMs with LLM embeddings outperform TF-IDF features, prompting underperforms, and fine-tuned LLMs achieve the highest performance, with LLAMAft reaching 0.716 F1 score. Ensembles of the best models further improve results (0.737 F1). Misclassifications often arise from misleading clues or missing context, highlighting the need for context-aware classifiers. Models trained on Mozilla transfer to other projects, with a small amount of project-specific data, further enhancing results. This work supports software maintenance, issue understanding, and solution reuse.
SEFeb 6, 2025
Combining Language and App UI Analysis for the Automated Assessment of Bug Reproduction StepsJunayed Mahmud, Antu Saha, Oscar Chaparro et al.
Bug reports are essential for developers to confirm software problems, investigate their causes, and validate fixes. Unfortunately, reports often miss important information or are written unclearly, which can cause delays, increased issue resolution effort, or even the inability to solve issues. One of the most common components of reports that are problematic is the steps to reproduce the bug(s) (S2Rs), which are essential to replicate the described program failures and reason about fixes. Given the proclivity for deficiencies in reported S2Rs, prior work has proposed techniques that assist reporters in writing or assessing the quality of S2Rs. However, automated understanding of S2Rs is challenging, and requires linking nuanced natural language phrases with specific, semantically related program information. Prior techniques often struggle to form such language to program connections - due to issues in language variability and limitations of information gleaned from program analyses. To more effectively tackle the problem of S2R quality annotation, we propose a new technique called AstroBR, which leverages the language understanding capabilities of LLMs to identify and extract the S2Rs from bug reports and map them to GUI interactions in a program state model derived via dynamic analysis. We compared AstroBR to a related state-of-the-art approach and we found that AstroBR annotates S2Rs 25.2% better (in terms of F1 score) than the baseline. Additionally, AstroBR suggests more accurate missing S2Rs than the baseline (by 71.4% in terms of F1 score).
36.6SEApr 1
Automated Generation of High-Quality Bug Reports for Android ApplicationsAntu Saha, Atish Kumar Dipongkor, Sam Bennett et al.
Most defects in mobile applications are visually observable on the device screen. To track these defects, users, testers, and developers must manually submit bug reports, especially in the absence of crashes. However, these reports are frequently ambiguous or inaccurate, often omitting essential components such as the Observed Behavior (OB), Expected Behavior (EB), or Steps to Reproduce (S2Rs). Low-quality reports hinder developers' ability to understand and reproduce defects, delaying resolution and leading to incorrect or unresolvable fixes. In this paper, we posit that providing specific app-related information (e.g., GUI interactions or specific screens where bugs appear) to LLMs as key points of context can assist in automatically generating clear, detailed, and accurate OB, EB, and S2Rs. We built and evaluated a novel approach, BugScribe, that generates bug reports in this way. To support the evaluation, we introduce a unified quality framework that defines correctness and completeness dimensions for OB, EB, and S2Rs. Using 48 bug reports from 26 Android apps, we show that BugScribe produces higher-quality and more accurate components than the original reports and outperforms recent LLM-based baselines. We envision that BugScribe can serve as a practical assistant for testers and developers by enhancing incomplete bug reports with reliable and accurate OB, EB, and S2Rs, thereby streamlining bug resolution and improving mobile app quality.