20.1SEJun 5Code
From Custom Logic to APIs: Understanding and Recommending API Replacement RefactoringsBridget Nyirongo, Yanjie Jiang, Yuxia Zhang et al.
Software refactoring is essential for maintaining code quality. However, API replacement refactoring, which replaces custom logic with API calls, remains underexplored. Existing refactoring tools provide limited support for detecting such opportunities because they rely on predefined templates and have difficulty capturing complex, multi-statement semantic equivalents. To address this limitation, we conduct the first empirical study of API replacement refactorings by mining 166,299 commits across six open-source Java projects and manually analyzing a curated subset of 1,800 commits, from which we identify 366 validated instances to characterize their scope, categories, and recurring patterns. Based on these insights, we propose AKIRA (Adaptive Knowledge Discovery and Retrieval), a hybrid framework that integrates pattern-deterministic heuristics with a refactoring-aware knowledge base to assess the practical feasibility of recommending API replacement refactorings. Our evaluation shows that AKIRA achieves 90% recall and 88% precision on a manually curated dataset. Furthermore, on the external RETIWA dataset, AKIRA significantly improves the state of the art by increasing recall from 21% to 81% and precision from 40% to 78%. These results demonstrate the effectiveness of combining static pattern matching with semantic reasoning to support the automation of recommending complex API replacement refactorings.
23.9SEMay 12
Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub IssuesYanjie Jiang, Yian Huang, Guancheng Wang et al.
Large Language Models (LLMs) are increasingly deployed to resolve real-world GitHub issues. However, despite their potential, the specific failure modes of these models in complex repair tasks remain poorly understood. To characterize how LLM behavior diverges from human developer practices, this paper evaluates three state-of-the-art models, i.e., Claude 4.5 Sonnet, Gemini 3 Pro, and GPT-5, on the SWE-bench Verified dataset. We conduct a rigorous manual analysis of the symptoms and root causes underlying 243 failed attempts across 900 total trials. Our investigation first yields a unified failure taxonomy encompassing five distinct stages of the repair pipeline, within which we categorize typical failure symptoms and their prevalence. Secondly, our findings reveal that for all evaluated LLMs, strategy formulation and logic synthesis constitutes the most error-prone stage, followed by problem understanding, whereas localization exhibits the lowest failure rate. This suggests that LLMs may excel at fault localization, a task traditionally regarded as one of the most formidable challenges in automated program repair. Furthermore, we observe that robustness and operational costs (particularly in failure scenarios) vary significantly across different models. Finally, we uncover the root causes of these failures and propose actionable strategies to mitigate them. A particularly notable finding is that existing evaluation harnesses occasionally misjudge correct patches due to superficial discrepancies or hidden constraints. Collectively, our insights may provide promising directions for enhancing the effectiveness and reliability of LLM-based issue resolution.
SEFeb 27, 2021
Extracting Concise Bug-Fixing Patches from Human-Written Patches in Version Control SystemsYanjie Jiang, Hui Liu, Nan Niu et al.
High-quality and large-scale repositories of real bugs and their concise patches collected from real-world applications are critical for research in software engineering community. In such a repository, each real bug is explicitly associated with its fix. Therefore, on one side, the real bugs and their fixes} may inspire novel approaches for finding, locating, and repairing software bugs; on the other side, the real bugs and their fixes are indispensable for rigorous and meaningful evaluation of approaches for software testing, fault localization, and program repair. To this end, a number of such repositories, e.g., Defects4J, have been proposed. However, such repositories are rather small because their construction involves expensive human intervention. Although bug-fixing code commits as well as associated test cases could be retrieved from version control systems automatically, existing approaches could not yet automatically extract concise bug-fixing patches from bug-fixing commits because such commits often involve bug-irrelevant changes. In this paper, we propose an automatic approach, called BugBuilder, to extracting complete and concise bug-fixing patches from human-written patches in version control systems. It excludes refactorings by detecting refactorings involved in bug-fixing commits, and reapplying detected refactorings on the faulty version. It enumerates all subsets of the remaining part and validates them on test cases. If none of the subsets has the potential to be a complete bug-fixing patch, the remaining part as a whole is taken as a complete and concise bug-fixing patch. Evaluation results on 809 real bug-fixing commits in Defects4J suggest that BugBuilder successfully generated complete and concise bug-fixing patches for forty percent of the bug-fixing commits, and its precision (99%) was even higher than human experts.