Kewen Peng

SE
h-index7
9papers
118citations
Novelty43%
AI Score30

9 Papers

LGApr 17, 2025Code
Software Engineering Principles for Fairer Systems: Experiments with GroupCART

Kewen Peng, Hao Zhuo, Yicheng Yang et al.

Discrimination-aware classification aims to make accurate predictions while satisfying fairness constraints. Traditional decision tree learners typically optimize for information gain in the target attribute alone, which can result in models that unfairly discriminate against protected social groups (e.g., gender, ethnicity). Motivated by these shortcomings, we propose GroupCART, a tree-based ensemble optimizer that avoids bias during model construction by optimizing not only for decreased entropy in the target attribute but also for increased entropy in protected attributes. Our experiments show that GroupCART achieves fairer models without data transformation and with minimal performance degradation. Furthermore, the method supports customizable weighting, offering a smooth and flexible trade-off between predictive performance and fairness based on user requirements. These results demonstrate that algorithmic bias in decision tree models can be mitigated through multi-task, fairness-aware learning. All code and datasets used in this study are available at: https://github.com/anonymous12138/groupCART.

LGApr 23, 2025
Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score Matching

Kewen Peng, Yicheng Yang, Hao Zhuo

Fairness-aware learning aims to mitigate discrimination against specific protected social groups (e.g., those categorized by gender, ethnicity, age) while minimizing predictive performance loss. Despite efforts to improve fairness in machine learning, prior studies have shown that many models remain unfair when measured against various fairness metrics. In this paper, we examine whether the way training and testing data are sampled affects the reliability of reported fairness metrics. Since training and test sets are often randomly sampled from the same population, bias present in the training data may still exist in the test data, potentially skewing fairness assessments. To address this, we propose FairMatch, a post-processing method that applies propensity score matching to evaluate and mitigate bias. FairMatch identifies control and treatment pairs with similar propensity scores in the test set and adjusts decision thresholds for different subgroups accordingly. For samples that cannot be matched, we perform probabilistic calibration using fairness-aware loss functions. Experimental results demonstrate that our approach can (a) precisely locate subsets of the test data where the model is unbiased, and (b) significantly reduce bias on the remaining data. Overall, propensity score matching offers a principled way to improve both fairness evaluation and mitigation, without sacrificing predictive performance.

LGApr 21, 2025
Combating Toxic Language: A Review of LLM-Based Strategies for Software Engineering

Hao Zhuo, Yicheng Yang, Kewen Peng

Large Language Models (LLMs) have become integral to software engineering (SE), where they are increasingly used in development workflows. However, their widespread use raises concerns about the presence and propagation of toxic language--harmful or offensive content that can foster exclusionary environments. This paper provides a comprehensive review of recent research on toxicity detection and mitigation, focusing on both SE-specific and general-purpose datasets. We examine annotation and preprocessing techniques, assess detection methodologies, and evaluate mitigation strategies, particularly those leveraging LLMs. Additionally, we conduct an ablation study demonstrating the effectiveness of LLM-based rewriting for reducing toxicity. By synthesizing existing work and identifying open challenges, this review highlights key areas for future research to ensure the responsible deployment of LLMs in SE and beyond.

LGOct 3, 2021
FairMask: Better Fairness via Model-based Rebalancing of Protected Attributes

Kewen Peng, Joymallya Chakraborty, Tim Menzies

Context: Machine learning software can generate models that inappropriately discriminate against specific protected social groups (e.g., groups based on gender, ethnicity, etc). Motivated by those results, software engineering researchers have proposed many methods for mitigating those discriminatory effects. While those methods are effective in mitigating bias, few of them can provide explanations on what is the root cause of bias. Objective: We aim at better detection and mitigation of algorithmic discrimination in machine learning software problems. Method: Here we propose xFAIR, a model-based extrapolation method, that is capable of both mitigating bias and explaining the cause. In our xFAIR approach, protected attributes are represented by models learned from the other independent variables (and these models offer extrapolations over the space between existing examples). We then use the extrapolation models to relabel protected attributes later seen in testing data or deployment time. Our approach aims to offset the biased predictions of the classification model via rebalancing the distribution of protected attributes. Results: The experiments of this paper show that, without compromising (original) model performance, xFAIR can achieve significantly better group and individual fairness (as measured in different metrics) than benchmark methods. Moreover, when compared to another instance-based rebalancing method, our model-based approach shows faster runtime and thus better scalability. Conclusion: Algorithmic decision bias can be removed via extrapolation that smooths away outlier points. As evidence for this, our proposed xFAIR is not only performance-wise better (measured by fairness and performance metrics) than two state-of-the-art fairness algorithms.

SEJul 11, 2021
Fairer Software Made Easier (using "Keys")

Tim Menzies, Kewen Peng, Andre Lustosa

Can we simplify explanations for software analytics? Maybe. Recent results show that systems often exhibit a "keys effect"; i.e. a few key features control the rest. Just to say the obvious, for systems controlled by a few keys, explanation and control is just a matter of running a handful of "what-if" queries across the keys. By exploiting the keys effect, it should be possible to dramatically simplify even complex explanations, such as those required for ethical AI systems.

SEJun 4, 2021
VEER: Enhancing the Interpretability of Model-based Optimizations

Kewen Peng, Christian Kaltenecker, Norbert Siegmund et al.

Many software systems can be tuned for multiple objectives (e.g., faster runtime, less required memory, less network traffic or energy consumption, etc.). Optimizers built for different objectives suffer from "model disagreement"; i.e., they have different (or even opposite) insights and tactics on how to optimize a system. Model disagreement is rampant (at least for configuration problems). Yet prior to this paper, it has barely been explored. This paper shows that model disagreement can be mitigated via VEER, a one-dimensional approximation to the N-objective space. Since it is exploring a simpler goal space, VEER runs very fast (for eleven configuration problems). Even for our largest problem (with tens of thousands of possible configurations), VEER finds as good or better optimizations with zero model disagreements, three orders of magnitude faster (since its one-dimensional output no longer needs the sorting procedure). Based on the above, we recommend VEER as a very fast method to solve complex configuration problems, while at the same time avoiding model disagreement.

SEJul 6, 2020
Making Fair ML Software using Trustworthy Explanation

Joymallya Chakraborty, Kewen Peng, Tim Menzies

Machine learning software is being used in many applications (finance, hiring, admissions, criminal justice) having a huge social impact. But sometimes the behavior of this software is biased and it shows discrimination based on some sensitive attributes such as sex, race, etc. Prior works concentrated on finding and mitigating bias in ML models. A recent trend is using instance-based model-agnostic explanation methods such as LIME to find out bias in the model prediction. Our work concentrates on finding shortcomings of current bias measures and explanation methods. We show how our proposed method based on K nearest neighbors can overcome those shortcomings and find the underlying bias of black-box models. Our results are more trustworthy and helpful for the practitioners. Finally, We describe our future framework combining explanation and planning to build fair software.

SEJun 12, 2020
Defect Reduction Planning (using TimeLIME)

Kewen Peng, Tim Menzies

Software comes in releases. An implausible change to software is something that has never been changed in prior releases. When planning how to reduce defects, it is better to use plausible changes, i.e., changes with some precedence in the prior releases. To demonstrate these points, this paper compares several defect reduction planning tools. LIME is a local sensitivity analysis tool that can report the fewest changes needed to alter the classification of some code module (e.g., from "defective" to "non-defective"). TimeLIME is a new tool, introduced in this paper, that improves LIME by restricting its plans to just those attributes which change the most within a project. In this study, we compared the performance of LIME and TimeLIME and several other defect reduction planning algorithms. The generated plans were assessed via (a) the similarity scores between the proposed code changes and the real code changes made by developers; and (b) the improvement scores seen within projects that followed the plans. For nine project trails, we found that TimeLIME outperformed all other algorithms (in 8 out of 9 trials). Hence, we strongly recommend using past releases as a source of knowledge for computing fixes for new releases (using TimeLIME). Apart from these specific results about planning defect reductions and TimeLIME, the more general point of this paper is that our community should be more careful about using off-the-shelf AI tools, without first applying SE knowledge. In this case study, it was not difficult to augment a standard AI algorithm with SE knowledge (that past releases are a good source of knowledge for planning defect reductions). As shown here, once that SE knowledge is applied, this can result in dramatically better systems.

SEMar 15, 2020
How to Improve AI Tools (by Adding in SE Knowledge): Experiments with the TimeLIME Defect Reduction Tool

Kewen Peng, Tim Menzies

AI algorithms are being used with increased frequency in SE research and practice. Such algorithms are usually commissioned and certified using data from outside the SE domain. Can we assume that such algorithms can be used ''off-the-shelf'' (i.e. with no modifications)? To say that another way, are there special features of SE problems that suggest a different and better way to use AI tools? To answer these questions, this paper reports experiments with TimeLIME, a variant of the LIME explanation algorithm from KDD'16. LIME can offer recommendations on how to change static code attributes in order to reduce the number of defects in the next software release. That version of LIME used an internal weighting tool to decide what attributes to include/exclude in those recommendations. TimeLIME improves on that weighting scheme using the following SE knowledge: software comes in releases; an implausible change to software is something that has never been changed in prior releases; so it is better to use plausible changes, i.e. changes with some precedent in the prior releases. By restricting recommendations to just the frequently changed attributes, TimeLIME can produce (a)~dramatically better explanations of what causes defects and (b)~much better recommendations on how to fix buggy code. Apart from these specific results about defect reduction and TimeLIME, the more general point of this paper is that our community should be more careful about using off-the-shelf AI tools, without first applying SE knowledge. As shown here, it may not be a complex matter to apply that knowledge. Further, once that SE knowledge is applied, this can result in dramatically better systems.