CLAug 3, 2023Code
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code GenerationXueying Du, Mingwei Liu, Kaixin Wang et al.
In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes. Our benchmark is available at https://github.com/FudanSELab/ClassEval.
CLAug 2, 2023Code
Evaluating Instruction-Tuned Large Language Models on Code Comprehension and GenerationZhiqiang Yuan, Junwei Liu, Qiancheng Zi et al.
In this work, we evaluate 10 open-source instructed LLMs on four representative code comprehension and generation tasks. We have the following main findings. First, for the zero-shot setting, instructed LLMs are very competitive on code comprehension and generation tasks and sometimes even better than small SOTA models specifically fine-tuned on each downstream task. We also find that larger instructed LLMs are not always better on code-related tasks. Second, for the few-shot setting, we find that adding demonstration examples substantially helps instructed LLMs perform better on most code comprehension and generation tasks; however, the examples would sometimes induce unstable or even worse performance. Furthermore, we find widely-used BM25-based shot selection strategy significantly outperforms the basic random selection or fixed selection only on generation problems. Third, for the fine-tuning setting, we find that fine-tuning could further improve the model performance on downstream code comprehension and generation tasks compared to the zero-shot/one-shot performance. In addition, after being fine-tuned on the same downstream task dataset, instructed LLMs outperform both the small SOTA models and similar-scaled LLMs without instruction tuning. Based on our findings, we further present practical implications on model and usage recommendation, performance and cost trade-offs, and future direction.
SESep 4, 2024Code
Large Language Model-Based Agents for Software Engineering: A SurveyJunwei Liu, Kaixin Wang, Yixuan Chen et al.
The recent advance in Large Language Models (LLMs) has shaped a new paradigm of AI agents, i.e., LLM-based agents. Compared to standalone LLMs, LLM-based agents substantially extend the versatility and expertise of LLMs by enhancing LLMs with the capabilities of perceiving and utilizing external resources and tools. To date, LLM-based agents have been applied and shown remarkable effectiveness in Software Engineering (SE). The synergy between multiple agents and human interaction brings further promise in tackling complex real-world SE problems. In this work, we present a comprehensive and systematic survey on LLM-based agents for SE. We collect 124 papers and categorize them from two perspectives, i.e., the SE and agent perspectives. In addition, we discuss open challenges and future directions in this critical domain. The repository of this survey is at https://github.com/FudanSELab/Agent4SE-Paper-List.
SEFeb 3, 2023
KNOD: Domain Knowledge Distilled Tree Decoder for Automated Program RepairNan Jiang, Thibaud Lutellier, Yiling Lou et al.
Automated Program Repair (APR) improves software reliability by generating patches for a buggy program automatically. Recent APR techniques leverage deep learning (DL) to build models to learn to generate patches from existing patches and code corpora. While promising, DL-based APR techniques suffer from the abundant syntactically or semantically incorrect patches in the patch space. These patches often disobey the syntactic and semantic domain knowledge of source code and thus cannot be the correct patches to fix a bug. We propose a DL-based APR approach KNOD, which incorporates domain knowledge to guide patch generation in a direct and comprehensive way. KNOD has two major novelties, including (1) a novel three-stage tree decoder, which directly generates Abstract Syntax Trees of patched code according to the inherent tree structure, and (2) a novel domain-rule distillation, which leverages syntactic and semantic rules and teacher-student distributions to explicitly inject the domain knowledge into the decoding procedure during both the training and inference phases. We evaluate KNOD on three widely-used benchmarks. KNOD fixes 72 bugs on the Defects4J v1.2, 25 bugs on the QuixBugs, and 50 bugs on the additional Defects4J v2.0 benchmarks, outperforming all existing APR tools.
SEApr 20Code
EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering AgentsYaoqi Guo, Ying Xiao, Jie M. Zhang et al.
Software engineering (SE) agents powered by large language models are increasingly adopted in practice, yet they often incur substantial monetary cost. We introduce EET, an experience-driven early termination approach that reduces the cost of SE agents while preserving task performance. EET extracts structured experience from prior issue-resolution executions and leverages it to guide early termination during patch generation and selection, reducing unproductive iterations. We evaluate EET on the SWE-bench Verified benchmark across three representative SE agents. EET consistently reduces total cost by 19%-55% (32% on average), with negligible loss in resolution rate (at most 0.2%). These efficiency gains are achieved, on average, by identifying early-termination opportunities for 11% of issues and reducing API calls, input tokens, and output tokens by 21%, 30%, and 25%, respectively. We release the code, prompts, and data at https://github.com/IanWalls/EET.
SESep 30, 2024
Semantic Alignment-Enhanced Code Translation via an LLM-Based Multi-Agent SystemZhiqiang Yuan, Weitong Chen, Hanlin Wang et al.
Code translation converts code from one programming language to another while maintaining its original functionality, which is crucial for software migration, system refactoring, and cross-platform development. Traditional rule-based methods rely on manually-written rules, which can be time-consuming and often result in less readable code. To overcome this, learning-based methods have been developed, leveraging parallel data to train models for automated code translation. More recently, the advance of Large Language Models (LLMs) further boosts learning-based code translation. Although promising, LLM-translated program still suffers from diverse quality issues (e.g., syntax errors and semantic errors). In particular, it can be challenging for LLMs to self-debug these errors when simply provided with the corresponding error messages. In this work, we propose a novel LLM-based multi-agent system TRANSAGENT, which enhances LLM-based code translation by fixing the syntax errors and semantic errors with the synergy between four LLM-based agents, including Initial Code Translator, Syntax Error Fixer, Code Aligner, and Semantic Error Fixer. The main insight of TRANSAGENT is to first localize the error code block in the target program based on the execution alignment between the target and source program, which can narrow down the fixing space and thus lower down the fixing difficulties. To evaluate TRANSAGENT, we first construct a new benchmark from recent programming tasks to mitigate the potential data leakage issue. On our benchmark, TRANSAGENT outperforms the latest LLM-based code translation technique UniTrans in both translation effectiveness and efficiency; additionally, our evaluation on different LLMs show the generalization of TRANSAGENT and our ablation study shows the contribution of each agent.
MANov 26, 2025
BAMAS: Structuring Budget-Aware Multi-Agent SystemsLiming Yang, Junyu Luo, Xuanzhe Liu et al.
Large language model (LLM)-based multi-agent systems have emerged as a powerful paradigm for enabling autonomous agents to solve complex tasks. As these systems scale in complexity, cost becomes an important consideration for practical deployment. However, existing work rarely addresses how to structure multi-agent systems under explicit budget constraints. In this paper, we propose BAMAS, a novel approach for building multi-agent systems with budget awareness. BAMAS first selects an optimal set of LLMs by formulating and solving an Integer Linear Programming problem that balances performance and cost. It then determines how these LLMs should collaborate by leveraging a reinforcement learning-based method to select the interaction topology. Finally, the system is instantiated and executed based on the selected agents and their collaboration topology. We evaluate BAMAS on three representative tasks and compare it with state-of-the-art agent construction methods. Results show that BAMAS achieves comparable performance while reducing cost by up to 86%.
SENov 4, 2025
EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based AgentsJunwei Liu, Chen Xu, Chong Wang et al.
Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each node in the feature map maintains multi-level information, including business logic, design, and code, which is propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by a substantial margin of 56.8%, while improving single-agent performance by 16.0%-76.6% across different base LLMs, highlighting the importance of dependency modeling, context propagation, and workflow-aware agent design for complex software projects. Our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.
SEApr 17, 2025Code
Code Copycat Conundrum: Demystifying Repetition in LLM-based Code GenerationMingwei Liu, Juntao Li, Ying Wang et al.
Despite recent advances in Large Language Models (LLMs) for code generation, the quality of LLM-generated code still faces significant challenges. One significant issue is code repetition, which refers to the model's tendency to generate structurally redundant code, resulting in inefficiencies and reduced readability. To address this, we conduct the first empirical study to investigate the prevalence and nature of repetition across 19 state-of-the-art code LLMs using three widely-used benchmarks. Our study includes both quantitative and qualitative analyses, revealing that repetition is pervasive and manifests at various granularities and extents, including character, statement, and block levels. We further summarize a taxonomy of 20 repetition patterns. Building on our findings, we propose DeRep, a rule-based technique designed to detect and mitigate repetition in generated code. We evaluate DeRep using both open-source benchmarks and in an industrial setting. Our results demonstrate that DeRep significantly outperforms baselines in reducing repetition (with an average improvements of 91.3%, 93.5%, and 79.9% in rep-3, rep-line, and sim-line metrics) and enhancing code quality (with a Pass@1 increase of 208.3% over greedy search). Furthermore, integrating DeRep improves the performance of existing repetition mitigation methods, with Pass@1 improvements ranging from 53.7% to 215.7%.
SEJan 26Code
Reducing False Positives in Static Bug Detection with LLMs: An Empirical Study in IndustryXueying Du, Jiayi Feng, Yi Zou et al.
Static analysis tools (SATs) are widely adopted in both academia and industry for improving software quality, yet their practical use is often hindered by high false positive rates, especially in large-scale enterprise systems. These false alarms demand substantial manual inspection, creating severe inefficiencies in industrial code review. While recent work has demonstrated the potential of large language models (LLMs) for false alarm reduction on open-source benchmarks, their effectiveness in real-world enterprise settings remains unclear. To bridge this gap, we conduct the first comprehensive empirical study of diverse LLM-based false alarm reduction techniques in an industrial context at Tencent, one of the largest IT companies in China. Using data from Tencent's enterprise-customized SAT on its large-scale Advertising and Marketing Services software, we construct a dataset of 433 alarms (328 false positives, 105 true positives) covering three common bug types. Through interviewing developers and analyzing the data, our results highlight the prevalence of false positives, which wastes substantial manual effort (e.g., 10-20 minutes of manual inspection per alarm). Meanwhile, our results show the huge potential of LLMs for reducing false alarms in industrial settings (e.g., hybrid techniques of LLM and static analysis eliminate 94-98% of false positives with high recall). Furthermore, LLM-based techniques are cost-effective, with per-alarm costs as low as 2.1-109.5 seconds and $0.0011-$0.12, representing orders-of-magnitude savings compared to manual review. Finally, our case analysis further identifies key limitations of LLM-based false alarm reduction in industrial settings.
AIMay 26, 2025Code
Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel BugsZhenhao Zhou, Zhuochen Huang, Yike He et al.
The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL$^+$, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL$^+$ substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs. Data and code are available at https://github.com/FudanSELab/LinuxFLBench.
AIMay 27, 2025Code
Can Agents Fix Agent Issues?Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen et al.
LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AgentIssue-Bench, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AgentIssue-Bench and reveal their limited effectiveness (i.e., with only 0.67% - 4.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://github.com/alfin06/AgentIssue-Bench.
CYNov 1, 2024
Benchmarking Bias in Large Language Models during Role-PlayingXinyue Li, Zhenpeng Chen, Jie M. Zhang et al.
Large Language Models (LLMs) have become foundational in modern language-driven applications, profoundly influencing daily life. A critical technique in leveraging their potential is role-playing, where LLMs simulate diverse roles to enhance their real-world utility. However, while research has highlighted the presence of social biases in LLM outputs, it remains unclear whether and to what extent these biases emerge during role-playing scenarios. In this paper, we introduce BiasLens, a fairness testing framework designed to systematically expose biases in LLMs during role-playing. Our approach uses LLMs to generate 550 social roles across a comprehensive set of 11 demographic attributes, producing 33,000 role-specific questions targeting various forms of bias. These questions, spanning Yes/No, multiple-choice, and open-ended formats, are designed to prompt LLMs to adopt specific roles and respond accordingly. We employ a combination of rule-based and LLM-based strategies to identify biased responses, rigorously validated through human evaluation. Using the generated questions as the benchmark, we conduct extensive evaluations of six advanced LLMs released by OpenAI, Mistral AI, Meta, Alibaba, and DeepSeek. Our benchmark reveals 72,716 biased responses across the studied LLMs, with individual models yielding between 7,754 and 16,963 biased responses, underscoring the prevalence of bias in role-playing contexts. To support future research, we have publicly released the benchmark, along with all scripts and experimental results.
SEDec 16, 2023
Exploring Large Language Models in Resolving Environment-Related Crash Bugs: Localizing and RepairingXueying Du, Mingwei Liu, Hanlin Wang et al.
Software crash bugs cause unexpected program behaviors or even abrupt termination, thus demanding immediate resolution. However, resolving crash bugs can be challenging due to their complex root causes, which can originate from issues in the source code or external factors like third-party library dependencies. Large language models (LLMs) have shown promise in software engineering tasks. However, existing research predominantly focuses on the capability of LLMs to localize and repair code-related crash bugs, leaving their effectiveness in resolving environment-related crash bugs in real-world software unexplored. To fill this gap, we conducted the first comprehensive study to assess the capability of LLMs in resolving real-world environment-related crash bugs. We first systematically compare LLMs' performance in resolving code-related and environment-related crash bugs with varying levels of crash contextual information. Our findings reveal that localization is the primary challenge for resolving code-related crashes, while repair poses a greater challenge for environment-related crashes. Furthermore, we investigate the impact of different prompt strategies on improving the resolution of environment-related crash bugs, incorporating different prompt templates and multi-round interactions. Building on this, we further explore an advanced active inquiry prompting strategy leveraging the self-planning capabilities of LLMs. Based on these explorations, we propose IntDiagSolver, an interactive methodology designed to enable precise crash bug resolution through ongoing engagement with LLMs. Extensive evaluations of IntDiagSolver across multiple LLMs (including GPT-3.5, GPT-4, Claude, CodeLlama, DeepSeek-R1, and Qwen-3-Coder) demonstrate consistent improvements in resolution accuracy, with substantial enhancements ranging from 9.1% to 43.3% in localization and 9.1% to 53.3% in repair.
CLApr 9, 2025
Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test OraclesZihao Xu, Junchen Ding, Yiling Lou et al.
Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.
SEOct 13, 2025
Project-Level C-to-Rust Translation via Synergistic Integration of Knowledge Graphs and Large Language ModelsZhiqiang Yuan, Wenjun Mao, Zhuo Chen et al.
Translating C code into safe Rust is an effective way to ensure its memory safety. Compared to rule-based translation which produces Rust code that remains largely unsafe, LLM-based methods can generate more idiomatic and safer Rust code because LLMs have been trained on vast amount of human-written idiomatic code. Although promising, existing LLM-based methods still struggle with project-level C-to-Rust translation. They typically partition a C project into smaller units (\eg{} functions) based on call graphs and translate them bottom-up to resolve program dependencies. However, this bottom-up, unit-by-unit paradigm often fails to translate pointers due to the lack of a global perspective on their usage. To address this problem, we propose a novel C-Rust Pointer Knowledge Graph (KG) that enriches a code-dependency graph with two types of pointer semantics: (i) pointer-usage information which record global behaviors such as points-to flows and map lower-level struct usage to higher-level units; and (ii) Rust-oriented annotations which encode ownership, mutability, nullability, and lifetime. Synthesizing the \kg{} with LLMs, we further propose \ourtool{}, which implements a project-level C-to-Rust translation technique. In \ourtool{}, the \kg{} provides LLMs with comprehensive pointer semantics from a global perspective, thus guiding LLMs towards generating safe and idiomatic Rust code from a given C project. Our experiments show that \ourtool{} reduces unsafe usages in translated Rust by 99.9\% compared to both rule-based translation and traditional LLM-based rewriting, while achieving an average 29.3\% higher functional correctness than those fuzzing-enhanced LLM methods.
LGAug 5, 2025
Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?Zhenpeng Chen, Xinyue Li, Jie M. Zhang et al.
Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods.
SEJun 17, 2024
Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAGXueying Du, Geng Zheng, Kaixin Wang et al.
Although LLMs have shown promising potential in vulnerability detection, this study reveals their limitations in distinguishing between vulnerable and similar-but-benign patched code (only 0.06 - 0.14 accuracy). It shows that LLMs struggle to capture the root causes of vulnerabilities during vulnerability detection. To address this challenge, we propose enhancing LLMs with multi-dimensional vulnerability knowledge distilled from historical vulnerabilities and fixes. We design a novel knowledge-level Retrieval-Augmented Generation framework Vul-RAG, which improves LLMs with an accuracy increase of 16% - 24% in identifying vulnerable and patched code. Additionally, vulnerability knowledge generated by Vul-RAG can further (1) serve as high-quality explanations to improve manual detection accuracy (from 60% to 77%), and (2) detect 10 previously-unknown bugs in the recent Linux kernel release with 6 assigned CVEs.
SEMay 15, 2021
When Automated Program Repair Meets Regression Testing -- An Extensive Study on 2 Million PatchesYiling Lou, Jun Yang, Samuel Benton et al.
In recent years, Automated Program Repair (APR) has been extensively studied in academia and even drawn wide attention from industry. However, APR techniques can be extremely time consuming since (1) a large number of patches can be generated for a given bug, and (2) each patch needs to be executed on the original tests to ensure its correctness. In the literature, various techniques (e.g., based on learning, mining, and constraint solving) have been proposed/studied to reduce the number of patches. Intuitively, every patch can be treated as a software revision during regression testing; thus, traditional Regression Test Selection (RTS) techniques can be leveraged to only execute the tests affected by each patch (as the other tests would keep the same outcomes) to further reduce patch execution time. However, few APR systems actually adopt RTS and there is still a lack of systematic studies demonstrating the benefits of RTS and the impact of different RTS strategies on APR. To this end, this paper presents the first extensive study of widely-used RTS techniques at different levels (i.e., class/method/statement levels) for 12 state-of-the-art APR systems on over 2M patches. Our study reveals various practical guidelines for bridging the gap between APR and regression testing.
SEJan 13, 2021
An Empirical Study on Deployment Faults of Deep Learning Based Mobile ApplicationsZhenpeng Chen, Huihan Yao, Yiling Lou et al.
Deep Learning (DL) is finding its way into a growing number of mobile software applications. These software applications, named as DL based mobile applications (abbreviated as mobile DL apps) integrate DL models trained using large-scale data with DL programs. A DL program encodes the structure of a desirable DL model and the process by which the model is trained using training data. Due to the increasing dependency of current mobile apps on DL, software engineering (SE) for mobile DL apps has become important. However, existing efforts in SE research community mainly focus on the development of DL models and extensively analyze faults in DL programs. In contrast, faults related to the deployment of DL models on mobile devices (named as deployment faults of mobile DL apps) have not been well studied. Since mobile DL apps have been used by billions of end users daily for various purposes including for safety-critical scenarios, characterizing their deployment faults is of enormous importance. To fill the knowledge gap, this paper presents the first comprehensive study on the deployment faults of mobile DL apps. We identify 304 real deployment faults from Stack Overflow and GitHub, two commonly used data sources for studying software faults. Based on the identified faults, we construct a fine-granularity taxonomy consisting of 23 categories regarding to fault symptoms and distill common fix strategies for different fault types. Furthermore, we suggest actionable implications and research avenues that could further facilitate the deployment of DL models on mobile devices.
SEApr 19, 2020
Interactive Patch Filtering as Debugging AidJingjing Liang, Ruyi Ji, Jiajun Jiang et al.
It is widely recognized that program repair tools need to have a high precision to be useful, i.e., the generated patches need to have a high probability to be correct. However, it is fundamentally difficult to ensure the correctness of the patches, and many tools compromise other aspects of repair performance such as recall for an acceptable precision. In this paper we ask a question: can a repair tool with a low precision be still useful? To explore this question, we propose an interactive filtering approach to patch review, which filters out incorrect patches by asking questions to the developers. Our intuition is that incorrect patches can still help understand the bug. With proper tool support, the benefit outweighs the cost even if there are many incorrect patches. We implemented the approach as an Eclipse plugin tool, InPaFer, and evaluated it with a simulated experiment and a user study with 30 developers. The results show that our approach improve the repair performance of developers, with 62.5% more successfully repaired bugs and 25.3% less debugging time in average. In particular, even if the generated patches are all incorrect, the performance of the developers would not be significantly reduced, and could be improved when some patches provide useful information for repairing, such as the faulty location and a partial fix.
SEOct 3, 2019
Can Automated Program Repair Refine Fault Localization?Yiling Lou, Ali Ghanbari, Xia Li et al.
Software bugs are prevalent in modern software systems and notoriously hard to debug manually. Therefore, a large body of research efforts have been dedicated to automated software debugging, including both automated fault localization and program repair. However, the existing fault localization techniques are usually ineffective on real-world software systems while even the most advanced program repair techniques can only fix a small ratio of real-world bugs. Although fault localization and program repair are inherently connected, we observe that in the literature their only connection is that program repair techniques usually use off-the-shelf fault localization techniques (e.g., Ochiai) to determine the potential candidate statements/elements for patching. In this work, we explore their connection in the other direction, i.e., can program repair in turn help with fault localization? In this way,we not only open a new dimension for more powerful fault localization, but also extend the application scope of program repair to all possible bugs (not only the bugs that can be directly automatically fixed).We have designed ProFL, a simplistic approach using patch-execution results (from program repair) as the feedback information for fault localization. The experimental results on the widely used Defects4J benchmark show that the basic ProFL can already localize 161 of the 395 studied bugs within Top-1, while state-of-the-art spectrum and mutation based fault localization techniques at most localize 117 within Top-1. We also demonstrate ProFL's effectiveness under different settings. Lastly, we show that ProFL can further boost state-of-the-art fault localization via both unsupervised and supervised learning.