LGJul 14, 2022
Bias Mitigation for Machine Learning Classifiers: A Comprehensive SurveyMax Hort, Zhenpeng Chen, Jie M. Zhang et al.
This paper provides a comprehensive survey of bias mitigation methods for achieving fairness in Machine Learning (ML) models. We collect a total of 341 publications concerning bias mitigation for ML classifiers. These methods can be distinguished based on their intervention procedure (i.e., pre-processing, in-processing, post-processing) and the technique they apply. We investigate how existing bias mitigation methods are evaluated in the literature. In particular, we consider datasets, metrics and benchmarking. Based on the gathered insights (e.g., What is the most popular fairness metric? How many datasets are used for evaluating bias mitigation methods?), we hope to support practitioners in making informed choices when developing and evaluating new bias mitigation methods.
SESep 4, 2024Code
Large Language Model-Based Agents for Software Engineering: A SurveyJunwei Liu, Kaixin Wang, Yixuan Chen et al.
The recent advance in Large Language Models (LLMs) has shaped a new paradigm of AI agents, i.e., LLM-based agents. Compared to standalone LLMs, LLM-based agents substantially extend the versatility and expertise of LLMs by enhancing LLMs with the capabilities of perceiving and utilizing external resources and tools. To date, LLM-based agents have been applied and shown remarkable effectiveness in Software Engineering (SE). The synergy between multiple agents and human interaction brings further promise in tackling complex real-world SE problems. In this work, we present a comprehensive and systematic survey on LLM-based agents for SE. We collect 124 papers and categorize them from two perspectives, i.e., the SE and agent perspectives. In addition, we discuss open challenges and future directions in this critical domain. The repository of this survey is at https://github.com/FudanSELab/Agent4SE-Paper-List.
SEJul 7, 2022
A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning ClassifiersZhenpeng Chen, Jie M. Zhang, Federica Sarro et al.
Software bias is an increasingly important operational concern for software engineers. We present a large-scale, comprehensive empirical study of 17 representative bias mitigation methods for Machine Learning (ML) classifiers, evaluated with 11 ML performance metrics (e.g., accuracy), 4 fairness metrics, and 20 types of fairness-performance trade-off assessment, applied to 8 widely-adopted software decision tasks. The empirical coverage is much more comprehensive, covering the largest numbers of bias mitigation methods, evaluation metrics, and fairness-performance trade-off measures compared to previous work on this important software property. We find that (1) the bias mitigation methods significantly decrease ML performance in 53% of the studied scenarios (ranging between 42%~66% according to different ML performance metrics); (2) the bias mitigation methods significantly improve fairness measured by the 4 used metrics in 46% of all the scenarios (ranging between 24%~59% according to different fairness metrics); (3) the bias mitigation methods even lead to decrease in both fairness and ML performance in 25% of the scenarios; (4) the effectiveness of the bias mitigation methods depends on tasks, models, the choice of protected attributes, and the set of metrics used to assess fairness and ML performance; (5) there is no bias mitigation method that can achieve the best trade-off in all the scenarios. The best method that we find outperforms other methods in 30% of the scenarios. Researchers and practitioners need to choose the bias mitigation method best suited to their intended application scenario(s).
SEApr 20Code
EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering AgentsYaoqi Guo, Ying Xiao, Jie M. Zhang et al.
Software engineering (SE) agents powered by large language models are increasingly adopted in practice, yet they often incur substantial monetary cost. We introduce EET, an experience-driven early termination approach that reduces the cost of SE agents while preserving task performance. EET extracts structured experience from prior issue-resolution executions and leverages it to guide early termination during patch generation and selection, reducing unproductive iterations. We evaluate EET on the SWE-bench Verified benchmark across three representative SE agents. EET consistently reduces total cost by 19%-55% (32% on average), with negligible loss in resolution rate (at most 0.2%). These efficiency gains are achieved, on average, by identifying early-termination opportunities for 11% of issues and reducing API calls, input tokens, and output tokens by 21%, 30%, and 25%, respectively. We release the code, prompts, and data at https://github.com/IanWalls/EET.
LGJul 25, 2023
Fairness Improvement with Multiple Protected Attributes: How Far Are We?Zhenpeng Chen, Jie M. Zhang, Federica Sarro et al.
Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The results reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on F1-score when handling two protected attributes is about twice that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate.
CYAug 5, 2023
Bias Behind the Wheel: Fairness Testing of Autonomous Driving SystemsXinyue Li, Zhenpeng Chen, Jie M. Zhang et al.
This paper conducts fairness testing of automated pedestrian detection, a crucial but under-explored issue in autonomous driving systems. We evaluate eight state-of-the-art deep learning-based pedestrian detectors across demographic groups on large-scale real-world datasets. To enable thorough fairness testing, we provide extensive annotations for the datasets, resulting in 8,311 images with 16,070 gender labels, 20,115 age labels, and 3,513 skin tone labels. Our findings reveal significant fairness issues, particularly related to age. The proportion of undetected children is 20.14% higher compared to adults. Furthermore, we explore how various driving scenarios affect the fairness of pedestrian detectors. We find that pedestrian detectors demonstrate significant gender biases during night time, potentially exacerbating the prevalent societal issue of female safety concerns during nighttime out. Moreover, we observe that pedestrian detectors can demonstrate both enhanced fairness and superior performance under specific driving conditions, which challenges the fairness-performance trade-off theory widely acknowledged in the fairness literature. We publicly release the code, data, and results to support future research on fairness in autonomous driving.
SEApr 14
LLMs Are Not a Silver Bullet: A Case Study on Software FairnessXinyue Li, Sixuan Li, Ying Xiao et al.
Fairness is a critical requirement for human-related, high-stakes software systems, motivating extensive research on bias mitigation. Prior work has largely focused on tabular data settings using traditional Machine Learning (ML) methods. With the rapid rise of Large Language Models (LLMs), recent studies have begun to explore their use for bias mitigation in the same setting. However, it remains unclear whether LLM-based methods offer advantages over traditional ML methods, leaving software engineers without clear guidance for practical adoption. To address this gap, we present a large-scale study comparing state-of-the-art ML- and LLM-based bias mitigation methods. We find that ML-based methods consistently outperform LLM-based methods in both fairness and predictive performance, with even strong LLMs failing to surpass established ML baselines. To understand why prior LLM-based studies report favorable results, we analyze their evaluation settings and show that these gains are largely driven by artificially balanced test data rather than realistic imbalanced distributions. We further observe that existing LLM-based methods primarily rely on in-context learning and thus fail to leverage all available training data. Motivated by this, we explore supervised fine-tuning on the full training set and find that, while it achieves competitive results, its advantages over traditional ML methods remain limited. These findings suggest that LLMs are not a silver bullet for software fairness.
MANov 26, 2025
BAMAS: Structuring Budget-Aware Multi-Agent SystemsLiming Yang, Junyu Luo, Xuanzhe Liu et al.
Large language model (LLM)-based multi-agent systems have emerged as a powerful paradigm for enabling autonomous agents to solve complex tasks. As these systems scale in complexity, cost becomes an important consideration for practical deployment. However, existing work rarely addresses how to structure multi-agent systems under explicit budget constraints. In this paper, we propose BAMAS, a novel approach for building multi-agent systems with budget awareness. BAMAS first selects an optimal set of LLMs by formulating and solving an Integer Linear Programming problem that balances performance and cost. It then determines how these LLMs should collaborate by leveraging a reinforcement learning-based method to select the interaction topology. Finally, the system is instantiated and executed based on the selected agents and their collaboration topology. We evaluate BAMAS on three representative tasks and compare it with state-of-the-art agent construction methods. Results show that BAMAS achieves comparable performance while reducing cost by up to 86%.
SEApr 16, 2024Code
LLM-Powered Test Case Generation for Detecting Bugs in Plausible ProgramsKaibo Liu, Zhenpeng Chen, Yiyang Liu et al. · pku
Detecting tricky bugs in plausible programs, those that pass existing test suites yet still contain bugs, remains a significant challenge in software testing. To address this problem, we propose TrickCatcher, an LLM-powered approach to generating test cases for uncovering bugs in plausible programs. TrickCatcher operates in three stages: First, it uses an LLM to generate program variants based on the program under test (PUT) and its specification. Second, it employs an LLM to construct an input generator from the specification for producing test inputs. Finally, these inputs are executed on both the PUT and its program variants to detect inconsistencies in their outputs. We evaluate TrickCatcher on two datasets, TrickyBugs and EvalPlus, which include 366 human-written and 151 AI-generated plausible programs with tricky bugs. TrickCatcher achieves recall, precision, and F1 scores that are 1.80x, 2.65x, and 1.66x those of the state-of-the-art baselines, respectively. Code and data used are available at https://github.com/RinCloud/TrickCatcher.
SEOct 16, 2024Code
Personality-Guided Code Generation Using Large Language ModelsYaoqi Guo, Zhenpeng Chen, Jie M. Zhang et al.
Code generation, the automatic creation of source code from natural language descriptions, has garnered significant attention due to its potential to streamline software development. Inspired by research that links task-personality alignment with improved development outcomes, we conduct an empirical study on personality-guided code generation using large language models (LLMs). Specifically, we investigate how emulating personality traits appropriate to the coding tasks affects LLM performance. We extensively evaluate this approach using seven widely adopted LLMs across four representative datasets. Our results show that personality guidance significantly enhances code generation accuracy, with improved pass rates in 23 out of 28 LLM-dataset combinations. Notably, in 11 cases, the improvement exceeds 5%, and in 5 instances, it surpasses 10%, with the highest gain reaching 12.9%. Additionally, personality guidance can be easily integrated with other prompting strategies to further boost performance. We open-source our code and data at https://github.com/IanWalls/Persona-Code.
SEMay 14
Probing Privacy Leaks in LLM-based Code Generation via Test GenerationYifei Ge, Zhenpeng Chen, Weisong Sun et al.
The widespread availability of large-scale code datasets has fueled the rapid development of large language models (LLMs) for code-related tasks. These datasets may include sensitive personally identifiable information (PII), which can lead to privacy leakage when LLMs memorize and reproduce it. However, existing privacy-leakage detection methods rely on ad-hoc prompt construction (manually or automatically designed). Therefore, they do not adequately approximate the real-world contexts in which PII appears in code corpora, making it difficult to extract realistic privacy leakage. In this paper, we propose a pipeline that simulates practical privacy-related code generation scenarios and adopts a test-driven strategy to elicit the memorized information from the generated test cases. We further introduce an automatically constructed privacy feature library that replaces manual prompt engineering by providing realistic templates and examples to guide test case generation. Large-scale experiments on 5 widely used LLMs show that our pipeline exposes more confirmed privacy leakage, achieving a 2.56 times increase in detected leakage compared to existing baselines.
SEFeb 20, 2025Code
Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code NaturalnessWeisong Sun, Yuchen Chen, Mengzhe Yuan et al.
Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has garnered increasing attention. In particular, NCMs are often trained on large-scale data from potentially untrustworthy sources, providing attackers with the opportunity to manipulate them by inserting crafted samples into the data. This type of attack is called a code poisoning attack (also known as a backdoor attack). It allows attackers to implant backdoors in NCMs and thus control model behavior, which poses a significant security threat. However, there is still a lack of effective techniques for detecting various complex code poisoning attacks. In this paper, we propose an innovative and lightweight technique for code poisoning detection named KillBadCode. KillBadCode is designed based on our insight that code poisoning disrupts the naturalness of code. Specifically, KillBadCode first builds a code language model (CodeLM) on a lightweight $n$-gram language model. Then, given poisoned data, KillBadCode utilizes CodeLM to identify those tokens in (poisoned) code snippets that will make the code snippets more natural after being deleted as trigger tokens. Considering that the removal of some normal tokens in a single sample might also enhance code naturalness, leading to a high false positive rate (FPR), we aggregate the cumulative improvement of each token across all samples. Finally, KillBadCode purifies the poisoned data by removing all poisoned samples containing the identified trigger tokens. The experimental results on two code poisoning attacks and four code intelligence tasks demonstrate that KillBadCode significantly outperforms four baselines. More importantly, KillBadCode is very efficient, with a minimum time consumption of only 5 minutes, and is 25 times faster than the best baseline on average.
AIMay 27, 2025Code
Can Agents Fix Agent Issues?Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen et al.
LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AgentIssue-Bench, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AgentIssue-Bench and reveal their limited effectiveness (i.e., with only 0.67% - 4.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://github.com/alfin06/AgentIssue-Bench.
AIMay 26, 2025Code
AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and HealthcareYing Xiao, Jie Huang, Ruijuan He et al.
Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their mistakes and the biases behind them pose life-critical risks. Bias linked to race, sex, and socioeconomic status is already well known, but a consistent and automatic testbed for measuring it is missing. To fill this gap, this paper presents AMQA -- an Adversarial Medical Question-Answering dataset -- built for automated, large-scale bias evaluation of LLMs in medical QA. AMQA includes 4,806 medical QA pairs sourced from the United States Medical Licensing Examination (USMLE) dataset, generated using a multi-agent framework to create diverse adversarial descriptions and question pairs. Using AMQA, we benchmark five representative LLMs and find surprisingly substantial disparities: even GPT-4.1, the least biased model tested, answers privileged-group questions over 10 percentage points more accurately than unprivileged ones. Compared with the existing benchmark CPV, AMQA reveals 15% larger accuracy gaps on average between privileged and unprivileged groups. Our dataset and code are publicly available at https://github.com/XY-Showing/AMQA to support reproducible research and advance trustworthy, bias-aware medical AI.
SEDec 26, 2023
A Prompt Learning Framework for Source Code SummarizationTingting Xu, Yun Miao, Chunrong Fang et al.
(Source) code summarization is the task of automatically generating natural language summaries (also called comments) for given code snippets. Recently, with the successful application of large language models (LLMs) in numerous fields, software engineering researchers have also attempted to adapt LLMs to solve code summarization tasks. The main adaptation schemes include instruction prompting, task-oriented (full-parameter) fine-tuning, and parameter-efficient fine-tuning (PEFT). However, instruction prompting involves designing crafted prompts and requires users to have professional domain knowledge, while task-oriented fine-tuning requires high training costs, and effective, tailored PEFT methods for code summarization are still lacking. This paper proposes an effective prompt learning framework for code summarization called PromptCS. It no longer requires users to rack their brains to design effective prompts. Instead, PromptCS trains a prompt agent that can generate continuous prompts to unleash the potential for LLMs in code summarization. Compared to the human-written discrete prompt, the continuous prompts are produced under the guidance of LLMs and are therefore easier to understand by LLMs. PromptCS is non-invasive to LLMs and freezes the parameters of LLMs when training the prompt agent, which can greatly reduce the requirements for training resources. Our comprehensive experimental results show that PromptCS significantly outperforms instruction prompting schemes (including zero-shot learning and few-shot learning) on all four widely used metrics, and is comparable to the task-oriented fine-tuning scheme. In some base LLMs, e.g., StarCoderBase-1B and -3B, PromptCS even outperforms the task-oriented fine-tuning scheme. More importantly, the training efficiency of PromptCS is faster than the task-oriented fine-tuning scheme, with a more pronounced advantage on larger LLMs.
CYNov 1, 2024
Benchmarking Bias in Large Language Models during Role-PlayingXinyue Li, Zhenpeng Chen, Jie M. Zhang et al.
Large Language Models (LLMs) have become foundational in modern language-driven applications, profoundly influencing daily life. A critical technique in leveraging their potential is role-playing, where LLMs simulate diverse roles to enhance their real-world utility. However, while research has highlighted the presence of social biases in LLM outputs, it remains unclear whether and to what extent these biases emerge during role-playing scenarios. In this paper, we introduce BiasLens, a fairness testing framework designed to systematically expose biases in LLMs during role-playing. Our approach uses LLMs to generate 550 social roles across a comprehensive set of 11 demographic attributes, producing 33,000 role-specific questions targeting various forms of bias. These questions, spanning Yes/No, multiple-choice, and open-ended formats, are designed to prompt LLMs to adopt specific roles and respond accordingly. We employ a combination of rule-based and LLM-based strategies to identify biased responses, rigorously validated through human evaluation. Using the generated questions as the benchmark, we conduct extensive evaluations of six advanced LLMs released by OpenAI, Mistral AI, Meta, Alibaba, and DeepSeek. Our benchmark reveals 72,716 biased responses across the studied LLMs, with individual models yielding between 7,754 and 16,963 biased responses, underscoring the prevalence of bias in role-playing contexts. To support future research, we have publicly released the benchmark, along with all scripts and experimental results.
LGDec 11, 2024
Diversity Drives Fairness: Ensemble of Higher Order Mutants for Intersectional Fairness of Machine Learning SoftwareZhenpeng Chen, Xinyue Li, Jie M. Zhang et al.
Intersectional fairness is a critical requirement for Machine Learning (ML) software, demanding fairness across subgroups defined by multiple protected attributes. This paper introduces FairHOME, a novel ensemble approach using higher order mutation of inputs to enhance intersectional fairness of ML software during the inference phase. Inspired by social science theories highlighting the benefits of diversity, FairHOME generates mutants representing diverse subgroups for each input instance, thus broadening the array of perspectives to foster a fairer decision-making process. Unlike conventional ensemble methods that combine predictions made by different models, FairHOME combines predictions for the original input and its mutants, all generated by the same ML model, to reach a final decision. Notably, FairHOME is even applicable to deployed ML software as it bypasses the need for training new models. We extensively evaluate FairHOME against seven state-of-the-art fairness improvement methods across 24 decision-making tasks using widely adopted metrics. FairHOME consistently outperforms existing methods across all metrics considered. On average, it enhances intersectional fairness by 47.5%, surpassing the currently best-performing method by 9.6 percentage points.
LGFeb 8, 2024
Anatomizing Deep Learning Inference in Web BrowsersQipeng Wang, Shiqi Jiang, Zhenpeng Chen et al.
Web applications have increasingly adopted Deep Learning (DL) through in-browser inference, wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience (QoE) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this gap, we make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. Our extensive analysis involves 9 representative DL models across Web browsers of 50 popular PC devices and 20 mobile devices. The results reveal that in-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices. The gap on mobile CPU and mobile GPU is 15.8 times and 7.8 times, respectively. Furthermore, we identify contributing factors to such latency gap, including underutilized hardware instruction sets, inherent overhead in the runtime environment, resource contention within the browser, and inefficiencies in software libraries and GPU abstractions. Additionally, in-browser inference imposes significant memory demands, at times exceeding 334.6 times the size of the DL models themselves, partly attributable to suboptimal memory management. We also observe that in-browser inference leads to a significant 67.2% increase in the time it takes for GUI components to render within Web browsers, significantly affecting the overall user QoE of Web applications reliant on this technology
LGAug 5, 2025
Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?Zhenpeng Chen, Xinyue Li, Jie M. Zhang et al.
Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods.
SEDec 12, 2021
Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering PerspectiveXuanzhe Liu, Diandian Gu, Zhenpeng Chen et al.
Deep learning (DL) has become a key component of modern software. In the "big model" era, the rich features of DL-based software substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained on the powerful cloud with large datasets. Hence, training effective DL models has become a vital stage in the whole software lifecycle. When training deep learning models, especially those big models, developers need to parallelize and distribute the computation and memory resources amongst multiple devices in the training process, which is known as distributed deep learning training, or distributed training for short. However, the unique challenges that developers encounter in distributed training process have not been studied in the software engineering community. Given the increasingly heavy dependence of current DL-based software on distributed training, this paper aims to fill in the knowledge gap and presents the first comprehensive study on developers' issues in distributed training. To this end, we analyze 1,131 real-world developers' issues about using these frameworks reported on Stack Overflow and GitHub. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. Based on the results, we suggest actionable implications on research avenues that can potentially facilitate the distributed training to develop DL-based software, such as focusing on the frequent and common fix patterns when designing testing or debugging tools, developing efficient testing and debugging techniques for communication configuration along with the synthesis of network configuration analysis, designing new multi-device checkpoint-and-replay techniques to help reproduction, and designing serverless APIs for cloud platforms.
CVJul 19, 2021
Learning point embedding for 3D data processingZhenpeng Chen, Yuan li
Among 2D convolutional networks on point clouds, point-based approaches consume point clouds of fixed size directly. By analysis of PointNet, a pioneer in introducing deep learning into point sets, we reveal that current point-based methods are essentially spatial relationship processing networks. In this paper, we take a different approach. Our architecture, named PE-Net, learns the representation of point clouds in high-dimensional space, and encodes the unordered input points to feature vectors, which standard 2D CNNs can be applied to. The recommended network can adapt to changes in the number of input points which is the limit of current methods. Experiments show that in the tasks of classification and part segmentation, PE-Net achieves the state-of-the-art performance in multiple challenging datasets, such as ModelNet and ShapeNetPart.
SEMay 15, 2021
When Automated Program Repair Meets Regression Testing -- An Extensive Study on 2 Million PatchesYiling Lou, Jun Yang, Samuel Benton et al.
In recent years, Automated Program Repair (APR) has been extensively studied in academia and even drawn wide attention from industry. However, APR techniques can be extremely time consuming since (1) a large number of patches can be generated for a given bug, and (2) each patch needs to be executed on the original tests to ensure its correctness. In the literature, various techniques (e.g., based on learning, mining, and constraint solving) have been proposed/studied to reduce the number of patches. Intuitively, every patch can be treated as a software revision during regression testing; thus, traditional Regression Test Selection (RTS) techniques can be leveraged to only execute the tests affected by each patch (as the other tests would keep the same outcomes) to further reduce patch execution time. However, few APR systems actually adopt RTS and there is still a lack of systematic studies demonstrating the benefits of RTS and the impact of different RTS strategies on APR. To this end, this paper presents the first extensive study of widely-used RTS techniques at different levels (i.e., class/method/statement levels) for 12 state-of-the-art APR systems on over 2M patches. Our study reveals various practical guidelines for bridging the gap between APR and regression testing.
LGFeb 10, 2021
Emojis predict dropouts of remote workers: An empirical study of emoji usage on GitHubXuan Lu, Wei Ai, Zhenpeng Chen et al.
Emotions at work have long been identified as critical signals of work motivations, status, and attitudes, and as predictors of various work-related outcomes. When more and more employees work remotely, these emotional signals of workers become harder to observe through daily, face-to-face communications. The use of online platforms to communicate and collaborate at work provides an alternative channel to monitor the emotions of workers. This paper studies how emojis, as non-verbal cues in online communications, can be used for such purposes and how the emotional signals in emoji usage can be used to predict future behavior of workers. In particular, we present how the developers on GitHub use emojis in their work-related activities. We show that developers have diverse patterns of emoji usage, which can be related to their working status including activity levels, types of work, types of communications, time management, and other behavioral patterns. Developers who use emojis in their posts are significantly less likely to dropout from the online work platform. Surprisingly, solely using emoji usage as features, standard machine learning models can predict future dropouts of developers at a satisfactory accuracy. Features related to the general use and the emotions of emojis appear to be important factors, while they do not rule out paths through other purposes of emoji use.
SEJan 13, 2021
An Empirical Study on Deployment Faults of Deep Learning Based Mobile ApplicationsZhenpeng Chen, Huihan Yao, Yiling Lou et al.
Deep Learning (DL) is finding its way into a growing number of mobile software applications. These software applications, named as DL based mobile applications (abbreviated as mobile DL apps) integrate DL models trained using large-scale data with DL programs. A DL program encodes the structure of a desirable DL model and the process by which the model is trained using training data. Due to the increasing dependency of current mobile apps on DL, software engineering (SE) for mobile DL apps has become important. However, existing efforts in SE research community mainly focus on the development of DL models and extensively analyze faults in DL programs. In contrast, faults related to the deployment of DL models on mobile devices (named as deployment faults of mobile DL apps) have not been well studied. Since mobile DL apps have been used by billions of end users daily for various purposes including for safety-critical scenarios, characterizing their deployment faults is of enormous importance. To fill the knowledge gap, this paper presents the first comprehensive study on the deployment faults of mobile DL apps. We identify 304 real deployment faults from Stack Overflow and GitHub, two commonly used data sources for studying software faults. Based on the identified faults, we construct a fine-granularity taxonomy consisting of 23 categories regarding to fault symptoms and distill common fix strategies for different fault types. Furthermore, we suggest actionable implications and research avenues that could further facilitate the deployment of DL models on mobile devices.
SEDec 2, 2020
Characterizing Commodity Serverless Computing PlatformsJinfeng Wen, Yi Liu, Zhenpeng Chen et al.
Serverless computing has become a new trending paradigm in cloud computing, allowing developers to focus on the development of core application logic and rapidly construct the prototype via the composition of independent functions. With the development and prosperity of serverless computing, major cloud vendors have successively rolled out their commodity serverless computing platforms. However, the characteristics of these platforms have not been systematically studied. Measuring these characteristics can help developers to select the most adequate serverless computing platform and develop their serverless-based applications in the right way. To fill this knowledge gap, we present a comprehensive study on characterizing mainstream commodity serverless computing platforms, including AWS Lambda, Google Cloud Functions, Azure Functions, and Alibaba Cloud Function Compute. Specifically, we conduct both qualitative analysis and quantitative analysis. In qualitative analysis, we compare these platforms from three aspects (i.e., development, deployment, and runtime) based on their official documentation to construct a taxonomy of characteristics. In quantitative analysis, we analyze the runtime performance of these platforms from multiple dimensions with well-designed benchmarks. First, we analyze three key factors that can influence the startup latency of serverless-based applications. Second, we compare the resource efficiency of different platforms with 16 representative benchmarks. Finally, we measure their performance difference when dealing with different concurrent requests, and explore the potential causes in a black-box fashion. Based on the results of both qualitative and quantitative analysis, we derive a series of findings and provide insightful implications for both developers and cloud vendors.
LGJun 12, 2020
Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone DataChengxu Yang, Qipeng Wang, Mengwei Xu et al.
Federated learning (FL) is an emerging, privacy-preserving machine learning paradigm, drawing tremendous attention in both academia and industry. A unique characteristic of FL is heterogeneity, which resides in the various hardware specifications and dynamic states across the participating devices. Theoretically, heterogeneity can exert a huge influence on the FL training process, e.g., causing a device unavailable for training or unable to upload its model updates. Unfortunately, these impacts have never been systematically studied and quantified in existing FL literature. In this paper, we carry out the first empirical study to characterize the impacts of heterogeneity in FL. We collect large-scale data from 136k smartphones that can faithfully reflect heterogeneity in real-world settings. We also build a heterogeneity-aware FL platform that complies with the standard FL protocol but with heterogeneity in consideration. Based on the data and the platform, we conduct extensive experiments to compare the performance of state-of-the-art FL algorithms under heterogeneity-aware and heterogeneity-unaware settings. Results show that heterogeneity causes non-trivial performance degradation in FL, including up to 9.2% accuracy drop, 2.32x lengthened training time, and undermined fairness. Furthermore, we analyze potential impact factors and find that device failure and participant bias are two potential factors for performance degradation. Our study provides insightful implications for FL practitioners. On the one hand, our findings suggest that FL algorithm designers consider necessary heterogeneity during the evaluation. On the other hand, our findings urge system providers to design specific mechanisms to mitigate the impacts of heterogeneity.
SEMay 2, 2020
A Comprehensive Study on Challenges in Deploying Deep Learning Based SoftwareZhenpeng Chen, Yanbin Cao, Yuanqiang Liu et al.
Deep learning (DL) becomes increasingly pervasive, being used in a wide range of software applications. These software applications, named as DL based software (in short as DL software), integrate DL models trained using a large data corpus with DL programs written based on DL frameworks such as TensorFlow and Keras. A DL program encodes the network structure of a desirable DL model and the process by which the model is trained using the training data. To help developers of DL software meet the new challenges posed by DL, enormous research efforts in software engineering have been devoted. Existing studies focus on the development of DL software and extensively analyze faults in DL programs. However, the deployment of DL software has not been comprehensively studied. To fill this knowledge gap, this paper presents a comprehensive study on understanding challenges in deploying DL software. We mine and analyze 3,023 relevant posts from Stack Overflow, a popular Q&A website for developers, and show the increasing popularity and high difficulty of DL software deployment among developers. We build a taxonomy of specific challenges encountered by developers in the process of DL software deployment through manual inspection of 769 sampled posts and report a series of actionable implications for researchers, developers, and DL framework vendors.
SEJul 4, 2019
SEntiMoji: An Emoji-Powered Learning Approach for Sentiment Analysis in Software EngineeringZhenpeng Chen, Yanbin Cao, Xuan Lu et al.
Sentiment analysis has various application scenarios in software engineering (SE), such as detecting developers' emotions in commit messages and identifying their opinions on Q&A forums. However, commonly used out-of-the-box sentiment analysis tools cannot obtain reliable results on SE tasks and the misunderstanding of technical jargon is demonstrated to be the main reason. Then, researchers have to utilize labeled SE-related texts to customize sentiment analysis for SE tasks via a variety of algorithms. However, the scarce labeled data can cover only very limited expressions and thus cannot guarantee the analysis quality. To address such a problem, we turn to the easily available emoji usage data for help. More specifically, we employ emotional emojis as noisy labels of sentiments and propose a representation learning approach that uses both Tweets and GitHub posts containing emojis to learn sentiment-aware representations for SE-related texts. These emoji-labeled posts can not only supply the technical jargon, but also incorporate more general sentiment patterns shared across domains. They as well as labeled data are used to learn the final sentiment classifier. Compared to the existing sentiment analysis methods used in SE, the proposed approach can achieve significant improvement on representative benchmark datasets. By further contrast experiments, we find that the Tweets make a key contribution to the power of our approach. This finding informs future research not to unilaterally pursue the domain-specific resource, but try to transform knowledge from the open domain through ubiquitous signals such as emojis.
CYDec 12, 2018
A First Look at Emoji Usage on GitHub: An Empirical StudyXuan Lu, Yanbin Cao, Zhenpeng Chen et al.
Emoji is becoming a ubiquitous language and gaining worldwide popularity in recent years including the field of software engineering (SE). As nonverbal cues, emojis are widely used in user understanding tasks such as sentiment analysis, but few work has been done to study emojis in SE scenarios. This paper presents a large scale empirical study on how GitHub users use emojis in development-related communications. We find that emojis are used by a considerable proportion of GitHub users. In comparison to Internet users, developers show interesting usage characteristics and have their own interpretation of the meanings of emojis. In addition, the usage of emojis reflects a positive and supportive culture of this community. Through a manual annotation task, we find that sentimental usage is a main intention of using emojis in issues, pull requests, and comments, while emojis are mainly used to emphasize important contents in README. These findings not only deepen our understanding about the culture of SE communities, but also provide implications on how to facilitate SE tasks with emojis such as sentiment analysis.
IRJun 7, 2018
Emoji-Powered Representation Learning for Cross-Lingual Sentiment ClassificationZhenpeng Chen, Sheng Shen, Ziniu Hu et al.
Sentiment classification typically relies on a large amount of labeled data. In practice, the availability of labels is highly imbalanced among different languages, e.g., more English texts are labeled than texts in any other languages, which creates a considerable inequality in the quality of related information services received by users speaking different languages. To tackle this problem, cross-lingual sentiment classification approaches aim to transfer knowledge learned from one language that has abundant labeled examples (i.e., the source language, usually English) to another language with fewer labels (i.e., the target language). The source and the target languages are usually bridged through off-the-shelf machine translation tools. Through such a channel, cross-language sentiment patterns can be successfully learned from English and transferred into the target languages. This approach, however, often fails to capture sentiment knowledge specific to the target language, and thus compromises the accuracy of the downstream classification task. In this paper, we employ emojis, which are widely available in many languages, as a new channel to learn both the cross-language and the language-specific sentiment patterns. We propose a novel representation learning method that uses emoji prediction as an instrument to learn respective sentiment-aware representations for each language. The learned representations are then integrated to facilitate cross-lingual sentiment classification. The proposed method demonstrates state-of-the-art performance on benchmark datasets, which is sustained even when sentiment labels are scarce.
HCMay 16, 2017
Through a Gender Lens: Learning Usage Patterns of Emojis from Large-Scale Android UsersZhenpeng Chen, Xuan Lu, Wei Ai et al.
Based on a large data set of emoji using behavior collected from smartphone users over the world, this paper investigates gender-specific usage of emojis. We present various interesting findings that evidence a considerable difference in emoji usage by female and male users. Such a difference is significant not just in a statistical sense; it is sufficient for a machine learning algorithm to accurately infer the gender of a user purely based on the emojis used in their messages. In real world scenarios where gender inference is a necessity, models based on emojis have unique advantages over existing models that are based on textual or contextual information. Emojis not only provide language-independent indicators, but also alleviate the risk of leaking private user information through the analysis of text and metadata.