Zhe Yu

SE
h-index10
48papers
842citations
Novelty47%
AI Score56

48 Papers

AIJun 4
FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG

Zhe Yu, Wenpeng Xing, Tiancheng Zhao et al.

When retrieved evidence contradicts parametric memory, language models frequently ignore context and default to memorized priors -- a failure that undermines the core purpose of retrieval augmentation. Contrastive decoding amplifies the context-conditioned output to suppress parametric bias, but existing methods rest on an implicit assumption that this bias is uniform across tokens. A single global contrastive weight over-penalizes safe tokens while leaving genuinely conflicted ones insufficiently corrected. We identify token-level conflict concentration: retrieval-memory tension is sharply heterogeneous, concentrated on a small fraction of answer-critical decoding steps. This reframes contrastive decoding from how much contrast to apply to where to apply it. We propose FIDES (Faithful Inference via Deep Evidence Signals), a training-free decoder that reads three internal signals probing retrieval-memory conflict at complementary depths -- output surface, hidden representations, and prediction trajectory -- and fuses them to govern intervention strength at each decoding step. Across three benchmarks and six backbones -- four primary 7B/8B models and two scaling backbones up to 70B -- FIDES achieves the best context fidelity in all 18 settings, outperforming the strongest training-free baseline by +3 to +13 points. On the 70B scale, fidelity reaches 92-94% while F1 surges to 62-63%, demonstrating that token-level selectivity unlocks generation capability that coarse contrastive rules suppress.

SYMay 10, 2017
Online Calibration of Phasor Measurement Unit Using Density-Based Spatial Clustering

Xinan Wang, Di Shi, Zhiwei Wang et al.

Data quality of Phasor Measurement Unit (PMU) is receiving increasing attention as it has been identified as one of the limiting factors that affect many wide-area measurement system (WAMS) based applications. In general, existing PMU calibration methods include offline testing and model based approaches. However, in practice, the effectiveness of both is limited due to the very strong assumptions employed. This paper presents a novel framework for online bias error detection and calibration of PMU measurement using density-based spatial clustering of applications with noise (DBSCAN) based on much relaxed assumptions. With a new problem formulation, the proposed data mining based methodology is applicable across a wide spectrum of practical conditions and one side-product of it is more accurate transmission line parameters for EMS database and protective relay settings. Case studies demonstrate the effectiveness of the proposed approach.

LGJun 5, 2023
A Data-driven Region Generation Framework for Spatiotemporal Transportation Service Management

Liyue Chen, Jiangyi Fang, Zhe Yu et al.

MAUP (modifiable areal unit problem) is a fundamental problem for spatial data management and analysis. As an instantiation of MAUP in online transportation platforms, region generation (i.e., specifying the areal unit for service operations) is the first and vital step for supporting spatiotemporal transportation services such as ride-sharing and freight transport. Most existing region generation methods are manually specified (e.g., fixed-size grids), suffering from poor spatial semantic meaning and inflexibility to meet service operation requirements. In this paper, we propose RegionGen, a data-driven region generation framework that can specify regions with key characteristics (e.g., good spatial semantic meaning and predictability) by modeling region generation as a multi-objective optimization problem. First, to obtain good spatial semantic meaning, RegionGen segments the whole city into atomic spatial elements based on road networks and obstacles (e.g., rivers). Then, it clusters the atomic spatial elements into regions by maximizing various operation characteristics, which is formulated as a multi-objective optimization problem. For this optimization problem, we propose a multi-objective co-optimization algorithm. Extensive experiments verify that RegionGen can generate more suitable regions than traditional methods for spatiotemporal service management.

SYMar 22, 2019
Bayesian Estimation Based Parameter Estimation for Composite Load

Chang Fu, Zhe Yu, Di Shi et al.

Accurate identification of parameters of load models is essential in power system computations, including simulation, prediction, and stability and reliability analysis. Conventional point estimation based composite load modeling approaches suffer from disturbances and noises and provide limited information of the system dynamics. In this work, a statistic (Bayesian Estimation) based distribution estimation approach is proposed for both static (ZIP) and dynamic (Induction Motor) load modeling. When dealing with multiple parameters, Gibbs sampling method is employed. In each iteration, the proposal samples each parameter while keeps others fixed. The proposed method provides a distribution estimation of load models coefficients and is robust to measurement errors.

AIMay 26
Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

Zhe Yu, Wenpeng Xing, Chen Ye et al.

Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness when evidence accumulates across turns. We show this assumption is fundamentally incorrect. Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their final recommendations - detecting epistemic conflict does not imply resolving it safely. Through a multi-turn document accumulation protocol across four model families (1.5B-32B parameters) and over 50,000 turn-level evaluations, we demonstrate that single-turn diagnostics systematically overestimate RAG safety, that contradiction acknowledgement is uncorrelated with safe resolution, a pattern corroborated by targeted human validation, and that no universal prompt fix exists. Converging mechanism evidence - hidden-state probing, attention analysis, and response-strategy taxonomy - points to action selection as the most plausible locus of the deficit: danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior. The gap between what models recognize and what they do must be measured and closed before retrieval-augmented systems can be trusted in high-stakes settings.

CRMay 26
Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control

Zhe Yu, Wenpeng Xing, Gaolei Li et al.

Retrieval-augmented generation (RAG) increasingly underpins high-stakes applications, yet remains vulnerable to Confundo-style poisoning where adversarially optimized documents manipulate generated outputs. Existing defenses assume that detecting poisoned evidence prevents harm. We show this assumption is incorrect: models exhibit a monitoring-control gap -- they can detect contradictions in retrieved evidence yet still act on poisoned claims. We introduce the Cordon Principle -- no agent capable of final synthesis may access untrusted natural-language evidence -- and realize it through CORDON-MAS, a compartmentalized framework that enforces this principle architecturally by separating evidence extraction, cross-source audit, and answer synthesis into agents with asymmetric memory privileges. Across five BEIR datasets, CORDON-MAS reduces attack success rate by 92.4\% relative to undefended RAG. This reframes RAG poisoning from a detection problem to an information-flow control problem.

AIMay 26
The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

Zhe Yu, Wenpeng Xing, Yunzhao Wei et al.

Retrieval-augmented generation promises to ground language model outputs in external evidence, yet the field has no reliable way to verify whether retrieved context actually governs generation -- a prerequisite for any high-stakes deployment. The standard assumption, that context-consistent output implies context-governed output, breaks when the retrieved document overlaps with the model's pretraining data: the model can produce faithful-looking text entirely from parametric memory, and both pathways yield indistinguishable output. We name this failure the attribution blind spot and introduce Computational Reality Monitoring (CRM) to address it. CRM operationalizes a principle adapted from cognitive science's reality monitoring framework: comparing internal representations with and without context reveals membership-conditioned representational divergence that output-level monitors systematically miss. CRM does not certify which source an individual generation used; it detects whether pretraining exposure leaves a measurable internal trajectory signature, establishing a necessary substrate for source attribution. Across nine model variants spanning three families, this divergence concentrates in architecture-specific layer patterns, receives converging support from block-level noise intervention, and generalizes across tasks and datasets while collapsing on domain-confounded benchmarks. The attribution blind spot is measurable and partially addressable: internal representations carry a diagnostic signal invisible at the output level, establishing a foundation for systems whose internal awareness of evidence provenance governs their external behavior.

AIMay 26
Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

Zhe Yu, Wenpeng Xing, Yunzhao Wei et al.

Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a model that answers more questions correctly must be better at assembling facts. We show that this assumption can be misleading: recipes with statistically indistinguishable atomic knowledge produce composition behaviour separated by over 40 percentage points, a phenomenon we call composition collapse: the systematic failure to assemble stably-known facts into chains, invisible to aggregate metrics. We introduce a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access, decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2--11 across four post-training recipes, this decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask, and suggests that claims about multi-hop reasoning improvement should be accompanied by atomic-gate-controlled composition metrics. Diagnostic probes further show that a substantial share of measured composition failure reflects generation-time computation constraints rather than permanent inability to compose.

SPNov 8, 2017
An Extended Kalman Filter Enhanced Hilbert-Huang Transform in Oscillation Detection

Zhe Yu, Di Shi, Haifeng Li et al.

Hilbert-Huang transform (HHT) has drawn great attention in power system analysis due to its capability to deal with dynamic signal and provide instantaneous characteristics such as frequency, damping, and amplitudes. However, its shortcomings, including mode mixing and end effects, are as significant as its advantages. A preliminary result of an extended Kalman filter (EKF) method to enhance HHT and hopefully to overcome these disadvantages is presented in this paper. The proposal first removes dynamic DC components in signals using empirical mode decomposition. Then an EKF model is applied to extract instant coefficients. Numerical results using simulated and real-world low-frequency oscillation data suggest the proposal can help to overcome the mode mixing and end effects with a properly chosen number of modes.

NANov 8, 2017
Recover the lost Phasor Measurement Unit Data Using Alternating Direction Multipliers Method

Mang Liao, Di Shi, Zhe Yu et al.

This paper presents a novel algorithm for recovering missing data of phasor measurement units (PMUs). Due to the low-rank property of PMU data, missing measurement recovery can be formulated as a low-rank matrix-completion problem. Based on maximum-margin matrix factorization, we propose an efficient algorithm based on alternating direction method of multipliers (ADMM) for solving the matrix completion problem. Comparing to existing approaches, the proposed ADMM based algorithm does not need to estimate the rank of the target data matrix and provides better performance in computation complexity. In addition, we consider the case of measurements missing from all PMU channels and provide a strategy of reshaping the matrix which contains the received PMU data for recovery. Numerical results using PMU measurements from IEEE 68-bus power system model illustrate the effectiveness and efficiency of the proposed approaches.

AIApr 7
From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMs

Zhe Yu, Wenpeng Xing, Meng Han

Hallucinations in medical large language models (LLMs) remain a safety-critical issue, particularly when available evidence is insufficient or conflicting. We study this problem in diabetic retinopathy (DR) decision settings and introduce RETINA-SAFE, an evidence-grounded benchmark aligned with retinal grading records, comprising 12,522 samples. RETINA-SAFE is organized into three evidence-relation tasks: E-Align (evidence-consistent), E-Conflict (evidence-conflicting), and E-Gap (evidence-insufficient). We further propose ECRT (Evidence-Conditioned Risk Triage), a two-stage white-box detection framework: Stage 1 performs Safe/Unsafe risk triage, and Stage 2 refines unsafe cases into contradiction-driven versus evidence-gap risks. ECRT leverages internal representation and logit shifts under CTX/NOCTX conditions, with class-balanced training for robust learning. Under evidence-grouped (not patient-disjoint) splits across multiple backbones, ECRT provides strong Stage-1 risk triage and explicit subtype attribution, improves Stage-1 balanced accuracy by +0.15 to +0.19 over external uncertainty and self-consistency baselines and by +0.02 to +0.07 over the strongest adapted supervised baseline, and consistently exceeds a single-stage white-box ablation on Stage-1 balanced accuracy. These findings support white-box internal signals grounded in retinal evidence as a practical route to interpretable medical LLM risk triage.

AIApr 7
LatentAudit: Real-Time White-Box Faithfulness Monitoring for Retrieval-Augmented Generation with Verifiable Deployment

Zhe Yu, Wenpeng Xing, Meng Han

Retrieval-augmented generation (RAG) mitigates hallucination but does not eliminate it: a deployed system must still decide, at inference time, whether its answer is actually supported by the retrieved evidence. We introduce LatentAudit, a white-box auditor that pools mid-to-late residual-stream activations from an open-weight generator and measures their Mahalanobis distance to the evidence representation. The resulting quadratic rule requires no auxiliary judge model, runs at generation time, and is simple enough to calibrate on a small held-out set. We show that residual-stream geometry carries a usable faithfulness signal, that this signal survives architecture changes and realistic retrieval failures, and that the same rule remains amenable to public verification. On PubMedQA with Llama-3-8B, LatentAudit reaches 0.942 AUROC with 0.77,ms overhead. Across three QA benchmarks and five model families (Llama-2/3, Qwen-2.5/3, Mistral), the monitor remains stable; under a four-way stress test with contradictions, retrieval misses, and partial-support noise, it reaches 0.9566--0.9815 AUROC on PubMedQA and 0.9142--0.9315 on HotpotQA. At 16-bit fixed-point precision, the audit rule preserves 99.8% of the FP16 AUROC, enabling Groth16-based public verification without revealing model weights or activations. Together, these results position residual-stream geometry as a practical basis for real-time RAG faithfulness monitoring and optional verifiable deployment.

AISep 7, 2022
An Argumentation-Based Legal Reasoning Approach for DL-Ontology

Zhe Yu, Yiwei Lu

Ontology is a popular method for knowledge representation in different domains, including the legal domain, and description logics (DL) is commonly used as its description language. To handle reasoning based on inconsistent DL-based legal ontologies, the current paper presents a structured argumentation framework particularly for reasoning in legal contexts on the basis of ASPIC+, and translates the legal ontology into formulas and rules of an argumentation theory. With a particular focus on the design of autonomous vehicles from the perspective of legal AI, we show that using this combined theory of formal argumentation and DL-based legal ontology, acceptable assertions can be obtained based on inconsistent ontologies, and the traditional reasoning tasks of DL ontologies can also be accomplished. In addition, a formal definition of explanations for the result of reasoning is presented.

LGJun 27, 2025Code
UniCA: Adapting Time Series Foundation Model to General Covariate-Aware Forecasting

Lu Han, Yu Liu, Qiwen Deng et al.

Time Series Foundation Models (TSFMs) have achieved remarkable success through large-scale pretraining. However, their design primarily targets real-valued series, limiting their ability to handle general forecasting tasks involving diverse and often heterogeneous covariates--such as categorical variables and multimodal data (e.g., images, text)--which are typically task-specific and difficult to leverage during pretraining. To address this gap, we propose Unified Covariate Adaptation (UniCA), a framework to bridge TSFMs with general covariate-aware forecasting. UniCA first performs covariate homogenization to transform heterogeneous covariates into high-level homogeneous series representations and then fuses them via a unified attention-based fusion mechanism. UniCA is compatible and universal for adaptation with both homogeneous and heterogeneous covariates, incorporating extra covariate information while preserving the generalization ability of TSFMs.Extensive experiments on multiple unimodal and multimodal covariate-aware forecasting benchmarks demonstrate the superiority of UniCA, highlighting the promise of covariate-aware TSFM adaptation in real-world forecasting scenarios. Codes are released on https://github.com/hanlu-nju/UniCA.

LGAug 4, 2024
A Multi-class Ride-hailing Service Subsidy System Utilizing Deep Causal Networks

Zhe Yu, Chi Xia, Shaosheng Cao et al.

In the ride-hailing industry, subsidies are predominantly employed to incentivize consumers to place more orders, thereby fostering market growth. Causal inference techniques are employed to estimate the consumer elasticity with different subsidy levels. However, the presence of confounding effects poses challenges in achieving an unbiased estimate of the uplift effect. We introduce a consumer subsidizing system to capture relationships between subsidy propensity and the treatment effect, which proves effective while maintaining a lightweight online environment.

LGJul 17, 2021Code
FairBalance: How to Achieve Equalized Odds With Data Pre-processing

Zhe Yu, Joymallya Chakraborty, Tim Menzies

This research seeks to benefit the software engineering society by providing a simple yet effective pre-processing approach to achieve equalized odds fairness in machine learning software. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. Amongst all the existing fairness notions, this work specifically targets "equalized odds" given its advantage in always allowing perfect classifiers. Equalized odds requires that members of every demographic group do not receive disparate mistreatment. Prior works either optimize for an equalized odds related metric during the learning process like a black-box, or manipulate the training data following some intuition. This work studies the root cause of the violation of equalized odds and how to tackle it. We found that equalizing the class distribution in each demographic group with sample weights is a necessary condition for achieving equalized odds without modifying the normal training process. In addition, an important partial condition for equalized odds (zero average odds difference) can be guaranteed when the class distributions are weighted to be not only equal but also balanced (1:1). Based on these analyses, we proposed FairBalance, a pre-processing algorithm which balances the class distribution in each demographic group by assigning calculated weights to the training data. On eight real-world datasets, our empirical results show that, at low computational overhead, the proposed pre-processing algorithm FairBalance can significantly improve equalized odds without much, if any damage to the utility. FairBalance also outperforms existing state-of-the-art approaches in terms of equalized odds. To facilitate reuse, reproduction, and validation, we made our scripts available at https://github.com/hil-se/FairBalance.

SENov 4, 2019Code
Understanding Static Code Warnings: an Incremental AI Approach

Xueqi Yang, Zhe Yu, Junjie Wang et al.

Knowledge-based systems reason over some knowledge base. Hence, an important issue for such systems is how to acquire the knowledge needed for their inference. This paper assesses active learning methods for acquiring knowledge for "static code warnings". Static code analysis is a widely-used method for detecting bugs and security vulnerabilities in software systems. As software becomes more complex, analysis tools also report lists of increasingly complex warnings that developers need to address on a daily basis. Such static code analysis tools are usually over-cautious; i.e. they often offer many warnings about spurious issues. Previous research work shows that about 35% to 91% of warnings reported as bugs by SA tools are actually unactionable (i.e., warnings that would not be acted on by developers because they are falsely suggested as bugs). Experienced developers know which errors are important and which can be safely ignored. How can we capture that experience? This paper reports on an incremental AI tool that watches humans reading false alarm reports. Using an incremental support vector machine mechanism, this AI tool can quickly learn to distinguish spurious false alarms from more serious matters that deserve further attention. In this work, nine open-source projects are employed to evaluate our proposed model on the features extracted by previous researchers and identify the actionable warnings in a priority order given by our algorithm. We observe that our model can identify over 90% of actionable warnings when our methods tell humans to ignore 70 to 80% of the warnings.

SEMay 20, 2019Code
Better Technical Debt Detection via SURVEYing

Fahmid M. Fahid, Zhe Yu, Tim Menzies

Software analytics can be improved by surveying; i.e. rechecking and (possibly) revising the labels offered by prior analysis. Surveying is a time-consuming task and effective surveyors must carefully manage their time. Specifically, they must balance the cost of further surveying against the additional benefits of that extra effort. This paper proposes SURVEY0, an incremental Logistic Regression estimation method that implements cost/benefit analysis. Some classifier is used to rank the as-yet-unvisited examples according to how interesting they might be. Humans then review the most interesting examples, after which their feedback is used to update an estimator for estimating how many examples are remaining. This paper evaluates SURVEY0 in the context of self-admitted technical debt. As software project mature, they can accumulate "technical debt" i.e. developer decisions which are sub-optimal and decrease the overall quality of the code. Such decisions are often commented on by programmers in the code; i.e. it is self-admitted technical debt (SATD). Recent results show that text classifiers can automatically detect such debt. We find that we can significantly outperform prior results by SURVEYing the data. Specifically, for ten open-source JAVA projects, we can find 83% of the technical debt via SURVEY0 using just 16% of the comments (and if higher levels of recall are required, SURVEY0can adjust towards that with some additional effort).

SEMay 5, 2019Code
Better Data Labelling with EMBLEM (and how that Impacts Defect Prediction)

Huy Tu, Zhe Yu, Tim Menzies

Standard automatic methods for recognizing problematic development commits can be greatly improved via the incremental application of human+artificial expertise. In this approach, called EMBLEM, an AI tool first explore the software development process to label commits that are most problematic. Humans then apply their expertise to check those labels (perhaps resulting in the AI updating the support vectors within their SVM learner). We recommend this human+AI partnership, for several reasons. When a new domain is encountered, EMBLEM can learn better ways to label which comments refer to real problems. Also, in studies with 9 open source software projects, labelling via EMBLEM's incremental application of human+AI is at least an order of magnitude cheaper than existing methods ($\approx$ eight times). Further, EMBLEM is very effective. For the data sets explored here, EMBLEM better labelling methods significantly improved $P_{opt}20$ and G-scores performance in nearly all the projects studied here.

CVJan 30
Modeling Art Evaluations from Comparative Judgments: A Deep Learning Approach to Predicting Aesthetic Preferences

Manoj Reddy Bethi, Sai Rupa Jhade, Pravallika Yaganti et al.

Modeling human aesthetic judgments in visual art presents significant challenges due to individual preference variability and the high cost of obtaining labeled data. To reduce cost of acquiring such labels, we propose to apply a comparative learning framework based on pairwise preference assessments rather than direct ratings. This approach leverages the Law of Comparative Judgment, which posits that relative choices exhibit less cognitive burden and greater cognitive consistency than direct scoring. We extract deep convolutional features from painting images using ResNet-50 and develop both a deep neural network regression model and a dual-branch pairwise comparison model. We explored four research questions: (RQ1) How does the proposed deep neural network regression model with CNN features compare to the baseline linear regression model using hand-crafted features? (RQ2) How does pairwise comparative learning compare to regression-based prediction when lacking access to direct rating values? (RQ3) Can we predict individual rater preferences through within-rater and cross-rater analysis? (RQ4) What is the annotation cost trade-off between direct ratings and comparative judgments in terms of human time and effort? Our results show that the deep regression model substantially outperforms the baseline, achieving up to $328\%$ improvement in $R^2$. The comparative model approaches regression performance despite having no access to direct rating values, validating the practical utility of pairwise comparisons. However, predicting individual preferences remains challenging, with both within-rater and cross-rater performance significantly lower than average rating prediction. Human subject experiments reveal that comparative judgments require $60\%$ less annotation time per item, demonstrating superior annotation efficiency for large-scale preference modeling.

CVJan 30
Modeling Image-Caption Rating from Comparative Judgments

Kezia Minni, Qiang Zhang, Monoshiz Mahbub Khan et al.

Rating the accuracy of captions in describing images is time-consuming and subjective for humans. In contrast, it is often easier for people to compare two captions and decide which one better matches a given image. In this work, we propose a machine learning framework that models such comparative judgments instead of direct ratings. The model can then be applied to rank unseen image-caption pairs in the same way as a regression model trained on direct ratings. Using the VICR dataset, we extract visual features with ResNet-50 and text features with MiniLM, then train both a regression model and a comparative learning model. While the regression model achieves better performance (Pearson's $ρ$: 0.7609 and Spearman's $r_s$: 0.7089), the comparative learning model steadily improves with more data and approaches the regression baseline. In addition, a small-scale human evaluation study comparing absolute rating, pairwise comparison, and same-image comparison shows that comparative annotation yields faster results and has greater agreement among human annotators. These results suggest that comparative learning can effectively model human preferences while significantly reducing the cost of human annotations.

AISep 18, 2024
Explaining Non-monotonic Normative Reasoning using Argumentation Theory with Deontic Logic

Zhe Yu, Yiwei Lu

In our previous research, we provided a reasoning system (called LeSAC) based on argumentation theory to provide legal support to designers during the design process. Building on this, this paper explores how to provide designers with effective explanations for their legally relevant design decisions. We extend the previous system for providing explanations by specifying norms and the key legal or ethical principles for justifying actions in normative contexts. Considering that first-order logic has strong expressive power, in the current paper we adopt a first-order deontic logic system with deontic operators and preferences. We illustrate the advantages and necessity of introducing deontic logic and designing explanations under LeSAC by modelling two cases in the context of autonomous driving. In particular, this paper also discusses the requirements of the updated LeSAC to guarantee rationality, and proves that a well-defined LeSAC can satisfy the rationality postulate for rule-based argumentation frameworks. This ensures the system's ability to provide coherent, legally valid explanations for complex design decisions.

MMSep 22, 2025
Mano Technical Report

Tianyu Fu, Anyang Su, Chenxu Zhao et al.

Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.

CRDec 14, 2025
One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

Yixin Tan, Zhe Yu, Jun Sakuma

Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal transferability is encoded in pretrained representations. Building on this insight, we propose the Probe-Guided Projection (PGP) attack, which steers optimization toward transferability-relevant directions. Experiments across multiple LLM families and diverse finetuned tasks confirm PGP's strong transfer success, underscoring the security risks inherent in the pretrain-to-finetune paradigm.

SEMar 6
Story Point Estimation Using Large Language Models

Pranam Prakash Shetty, Adarsh Balakrishnan, Mengqiao Xu et al.

This study investigates the use of large language models (LLMs) for story point estimation. Story points are unitless, project-specific effort estimates that help developers on the scrum team forecast which product backlog items they plan to complete in a sprint. To facilitate this process, machine learning models, especially deep neural networks, have been applied to predict the story points based on the title and description of each item. However, such machine learning models require sufficient amounts of training data (with ground truth story points annotated by human developers) from the same software project to achieve decent prediction performance. This motivated us to explore whether LLMs are capable of (RQ1) predicting story points without training data or (RQ2) with only a few training data points. Our empirical results with four LLMs on 16 software projects show that, without any training data (zero-shot prompting), LLMs can predict story points better than supervised deep learning models trained on 80% of the data. The prediction performance of LLMs can be further improved with a few training examples (few-shot prompting). In addition, a recent study explored the use of comparative judgments (between a given pair of items which one requires more effort to implement) instead of directly annotating the story points to reduce the cognitive burden on developers. Therefore, this study also explores (RQ3) whether comparative judgments are easier to predict than story points for LLMs and (RQ4) whether comparative judgments can serve as few-shot examples for LLMs to improve their predictions of story points. Empirical results suggest that it is not easier for LLMs to predict comparative judgments than to directly estimate the story points, but comparative judgments can serve as few-shot examples to improve the LLMs' prediction performance as well as the human-annotated story points.

LGNov 14, 2025
FairReweighing: Density Estimation-Based Reweighing Framework for Improving Separation in Fair Regression

Xiaoyin Xi, Zhe Yu

There has been a prevalence of applying AI software in both high-stakes public-sector and industrial contexts. However, the lack of transparency has raised concerns about whether these data-informed AI software decisions secure fairness against people of all racial, gender, or age groups. Despite extensive research on emerging fairness-aware AI software, up to now most efforts to solve this issue have been dedicated to binary classification tasks. Fairness in regression is relatively underexplored. In this work, we adopted a mutual information-based metric to assess separation violations. The metric is also extended so that it can be directly applied to both classification and regression problems with both binary and continuous sensitive attributes. Inspired by the Reweighing algorithm in fair classification, we proposed a FairReweighing pre-processing algorithm based on density estimation to ensure that the learned model satisfies the separation criterion. Theoretically, we show that the proposed FairReweighing algorithm can guarantee separation in the training data under a data independence assumption. Empirically, on both synthetic and real-world data, we show that FairReweighing outperforms existing state-of-the-art regression fairness solutions in terms of improving separation while maintaining high accuracy.

AIJul 30, 2025
Cross-Border Legal Adaptation of Autonomous Vehicle Design based on Logic and Non-monotonic Reasoning

Zhe Yu, Yiwei Lu, Burkhard Schafer et al.

This paper focuses on the legal compliance challenges of autonomous vehicles in a transnational context. We choose the perspective of designers and try to provide supporting legal reasoning in the design process. Based on argumentation theory, we introduce a logic to represent the basic properties of argument-based practical (normative) reasoning, combined with partial order sets of natural numbers to express priority. Finally, through case analysis of legal texts, we show how the reasoning system we provide can help designers to adapt their design solutions more flexibly in the cross-border application of autonomous vehicles and to more easily understand the legal implications of their decisions.

AIJul 19, 2025
Efficient Story Point Estimation With Comparative Learning

Monoshiz Mahbub Khan, Xiaoyin Xi, Andrew Meneely et al.

Story point estimation is an essential part of agile software development. Story points are unitless, project-specific effort estimates that help developers plan their sprints. Traditionally, developers estimate story points collaboratively using planning poker or other manual techniques. While the initial calibrating of the estimates to each project is helpful, once a team has converged on a set of precedents, story point estimation can become tedious and labor-intensive. Machine learning can reduce this burden, but only with enough context from the historical decisions made by the project team. That is, state-of-the-art models, such as GPT2SP and FastText-SVM, only make accurate predictions (within-project) when trained on data from the same project. The goal of this work is to streamline story point estimation by evaluating a comparative learning-based framework for calibrating project-specific story point prediction models. Instead of assigning a specific story point value to every backlog item, developers are presented with pairs of items, and indicate which item requires more effort. Using these comparative judgments, a machine learning model is trained to predict the story point estimates. We empirically evaluated our technique using data with 23,313 manual estimates in 16 projects. The model learned from comparative judgments can achieve on average 0.34 Spearman's rank correlation coefficient between its predictions and the ground truth story points. This is similar to, if not better than, the performance of a regression model learned from the ground truth story points. Therefore, the proposed comparative learning approach is more efficient than state-of-the-art regression-based approaches according to the law of comparative judgments - providing comparative judgments yields a lower cognitive burden on humans than providing ratings or categorical labels.

LGDec 21, 2021
Differential Parity: Relative Fairness Between Two Sets of Decisions

Zhe Yu, Xiaoyin Xi

With AI systems widely applied to assist human in decision-making processes such as talent hiring, school admission, and loan approval; there is an increasing need to ensure that the decisions made are fair. One major challenge for analyzing fairness in decisions is that the standards are highly subjective and contextual -- there is no consensus for what absolute fairness means for every scenario. Not to say that different fairness standards often conflict with each other. To bypass this issue, this work aims to test relative fairness in decisions. That is, instead of defining what are ``absolutely'' fair decisions, we propose to test the relative fairness of one decision set against another with differential parity -- the difference between two sets of decisions should be independent from a certain sensitive attribute. This proposed differential parity fairness notion has the following benefits: (1) it avoids the ambiguous and contradictory definition of ``absolutely'' fair decisions; (2) it reveals the relative preference and bias between two decision sets; (3) differential parity can serve as a new group fairness notion when a reference set of decisions (ground truths) is provided. One limitation for differential parity is that, it requires the two sets of decisions under comparison to be made on the same data subjects. To overcome this limitation, we propose to utilize a machine learning model to bridge the gap between the two decisions sets made on difference data and estimate the differential parity.

SEDec 2, 2021
On the Documentation of Refactoring Types

Eman Abdullah AlOmar, Jiaqian Liu, Kenneth Addo et al.

Commit messages are the atomic level of software documentation. They provide a natural language description of the code change and its purpose. Messages are critical for software maintenance and program comprehension. Unlike documenting feature updates and bug fixes, little is known about how developers document their refactoring activities. Developers can perform multiple refactoring operations, including moving methods, extracting classes, for various reasons. Yet, there is no systematic study that analyzes the extent to which the documentation of refactoring accurately describes the refactoring operations performed at the source code level. Therefore, this paper challenges the ability of refactoring documentation to adequately predict the refactoring types, performed at the commit level. Our analysis relies on the text mining of commit messages to extract the corresponding features that better represent each class. The extraction of text patterns, specific to each refactoring allows the design of a model that verifies the consistency of these patterns with their corresponding refactoring. Such verification process can be achieved via automatically predicting the method-level type of refactoring being applied, namely Extract Method, Inline Method, Move Method, Pull-up Method, Push-down Method, and Rename Method. We compared various classifiers, and a baseline keyword-based approach, in terms of their prediction performance, using a dataset of 5,004 commits. Our main findings show that the complexity of refactoring type prediction varies from one type to another. Rename method and Extract method were found to be the best documented refactoring activities, while Pull-up Method and Push-down Method were the hardest to be identified via textual descriptions. Such findings bring the attention of developers to the necessity of paying more attention to the documentation of these types.

CVMay 31, 2020
A General-Purpose Dehazing Algorithm based on Local Contrast Enhancement Approaches

Bangyong Sun, Vincent Whannou de Dravo, Zhe Yu

Dehazing is in the image processing and computer vision communities, the task of enhancing the image taken in foggy conditions. To better understand this type of algorithm, we present in this document a dehazing method which is suitable for several local contrast adjustment algorithms. We base it on two filters. The first filter is built with a step of normalization with some other statistical tricks while the last represents the local contrast improvement algorithm. Thus, it can work on both CPU and GPU for real-time applications. We hope that our approach will open the door to new ideas in the community. Other advantages of our method are first that it does not need to be trained, then it does not need additional optimization processing. Furthermore, it can be used as a pre-treatment or post-processing step in many vision tasks. In addition, it does not need to convert the problem into a physical interpretation, and finally that it is very fast. This family of defogging algorithms is fairly simple, but it shows promising results compared to state-of-the-art algorithms based not only on a visual assessment but also on objective criteria.

SEMay 31, 2020
Learning to Recognize Actionable Static Code Warnings (is Intrinsically Easy)

Xueqi Yang, Jianfeng Chen, Rahul Yedida et al.

Static code warning tools often generate warnings that programmers ignore. Such tools can be made more useful via data mining algorithms that select the "actionable" warnings; i.e. the warnings that are usually not ignored. In this paper, we look for actionable warnings within a sample of 5,675 actionable warnings seen in 31,058 static code warnings from FindBugs. We find that data mining algorithms can find actionable warnings with remarkable ease. Specifically, a range of data mining methods (deep learners, random forests, decision tree learners, and support vector machines) all achieved very good results (recalls and AUC (TRN, TPR) measures usually over 95% and false alarms usually under 5%). Given that all these learners succeeded so easily, it is appropriate to ask if there is something about this task that is inherently easy. We report that while our data sets have up to 58 raw features, those features can be approximated by less than two underlying dimensions. For such intrinsically simple data, many different kinds of learners can generate useful models with similar performance. Based on the above, we conclude that learning to recognize actionable static code warnings is easy, using a wide range of learning algorithms, since the underlying data is intrinsically simple. If we had to pick one particular learner for this task, we would suggest linear SVMs (since, at least in our sample, that learner ran relatively quickly and achieved the best median performance) and we would not recommend deep learning (since this data is intrinsically very simple).

CVMay 7, 2020
NTIRE 2020 Challenge on NonHomogeneous Dehazing

Codruta O. Ancuti, Cosmin Ancuti, Florin-Alexandru Vasluianu et al.

This paper reviews the NTIRE 2020 Challenge on NonHomogeneous Dehazing of images (restoration of rich details in hazy image). We focus on the proposed solutions and their results evaluated on NH-Haze, a novel dataset consisting of 55 pairs of real haze free and nonhomogeneous hazy images recorded outdoor. NH-Haze is the first realistic nonhomogeneous haze dataset that provides ground truth images. The nonhomogeneous haze has been produced using a professional haze generator that imitates the real conditions of haze scenes. 168 participants registered in the challenge and 27 teams competed in the final testing phase. The proposed solutions gauge the state-of-the-art in image dehazing.

SEMar 23, 2020
Fairway: A Way to Build Fair ML Software

Joymallya Chakraborty, Suvodeep Majumder, Zhe Yu et al.

Machine learning software is increasingly being used to make decisions that affect people's lives. But sometimes, the core part of this software (the learned model), behaves in a biased manner that gives undue advantages to a specific group of people (where those groups are determined by sex, race, etc.). This "algorithmic discrimination" in the AI software systems has become a matter of serious concern in the machine learning and software engineering community. There have been works done to find "algorithmic bias" or "ethical bias" in the software system. Once the bias is detected in the AI software system, the mitigation of bias is extremely important. In this work, we a)explain how ground-truth bias in training data affects machine learning model fairness and how to find that bias in AI software,b)propose a methodFairwaywhich combines pre-processing and in-processing approach to remove ethical bias from training data and trained model. Our results show that we can find bias and mitigate bias in a learned model, without much damaging the predictive performance of that model. We propose that (1) test-ing for bias and (2) bias mitigation should be a routine part of the machine learning software development life cycle. Fairway offers much support for these two purposes.

SEFeb 25, 2020
Identifying Self-Admitted Technical Debts with Jitterbug: A Two-step Approach

Zhe Yu, Fahmid Morshed Fahid, Huy Tu et al.

Keeping track of and managing Self-Admitted Technical Debts (SATDs) are important to maintaining a healthy software project. This requires much time and effort from human experts to identify the SATDs manually. The current automated solutions do not have satisfactory precision and recall in identifying SATDs to fully automate the process. To solve the above problems, we propose a two-step framework called Jitterbug for identifying SATDs. Jitterbug first identifies the "easy to find" SATDs automatically with close to 100% precision using a novel pattern recognition technique. Subsequently, machine learning techniques are applied to assist human experts in manually identifying the remaining "hard to find" SATDs with reduced human effort. Our simulation studies on ten software projects show that Jitterbug can identify SATDs more efficiently (with less human effort) than the prior state-of-the-art methods.

SESep 16, 2019
Assessing Expert System-Assisted Literature Reviews With a Case Study

Zhe Yu, Jeffrey C. Carver, Gregg Rothermel et al.

Given the large number of publications in software engineering, frequent literature reviews are required to keep current on work in specific areas. One tedious work in literature reviews is to find relevant studies amongst thousands of non-relevant search results. In theory, expert systems can assist in finding relevant work but those systems have primarily been tested in simulations rather than in application to actual literature reviews. Hence, few researchers have faith in such expert systems. Accordingly, using a realistic case study, this paper assesses how well our state-of-the-art expert system can help with literature reviews. The assessed literature review aimed at identifying test case prioritization techniques for automated UI testing, specifically from 8,349 papers on IEEE Xplore. This corpus was studied with an expert system that incorporates an incrementally updated human-in-the-loop active learning tool. Using that expert system, in three hours, we found 242 relevant papers from which we identified 12 techniques representing the state-of-the-art in test case prioritization when source code information is not available. These results were then validated by six other graduate students manually exploring the same corpus. Without the expert system, this task would have required 53 hours and would have found 27 additional papers. That is, our expert system achieved 90% recall with 6% of the human effort cost when compared to a conventional manual method. Significantly, the same 12 state-of-the-art test case prioritization techniques were identified by both the expert system and the manual method. That is, the 27 papers missed by the expert system would not have changed the conclusion of the literature review. Hence, if this result generalizes, it endorses the use of our expert system to assist in literature reviews.

SEMay 16, 2019
TERMINATOR: Better Automated UI Test Case Prioritization

Zhe Yu, Fahmid M. Fahid, Tim Menzies et al.

Automated UI testing is an important component of the continuous integration process of software development. A modern web-based UI is an amalgam of reports from dozens of microservices written by multiple teams. Queries on a page that opens up another will fail if any of that page's microservices fails. As a result, the overall cost for automated UI testing is high since the UI elements cannot be tested in isolation. For example, the entire automated UI testing suite at LexisNexis takes around 30 hours (3-5 hours on the cloud) to execute, which slows down the continuous integration process. To mitigate this problem and give developers faster feedback on their code, test case prioritization techniques are used to reorder the automated UI test cases so that more failures can be detected earlier. Given that much of the automated UI testing is "black box" in nature, very little information (only the test case descriptions and testing results) can be utilized to prioritize these automated UI test cases. Hence, this paper evaluates 17 "black box" test case prioritization approaches that do not rely on source code information. Among these, we propose a novel TCP approach, that dynamically re-prioritizes the test cases when new failures are detected, by applying and adapting a state of the art framework from the total recall problem. Experimental results on LexisNexis automated UI testing data show that our new approach (which we call TERMINATOR), outperformed prior state of the art approaches in terms of failure detection rates with negligible CPU overhead.

SEDec 4, 2018
Better Software Analytics via "DUO": Data Mining Algorithms Using/Used-by Optimizers

Amritanshu Agrawal, Tim Menzies, Leandro L. Minku et al.

This paper claims that a new field of empirical software engineering research and practice is emerging: data mining using/used-by optimizers for empirical studies or DUO. For example, data miners can generate models that are explored by optimizers. Also, optimizers can advise how to best adjust the control parameters of a data miner. This combined approach acts like an agent leaning over the shoulder of an analyst that advises "ask this question next" or "ignore that problem, it is not relevant to your goals". Further, those agents can help us build "better" predictive models, where "better" can be either greater predictive accuracy or faster modeling time (which, in turn, enables the exploration of a wider range of options). We also caution that the era of papers that just use data miners is coming to an end. Results obtained from an unoptimized data miner can be quickly refuted, just by applying an optimizer to produce a different (and better performing) model. Our conclusion, hence, is that for software analytics it is possible, useful and necessary to combine data mining and optimization using DUO.

SEAug 31, 2018
Total Recall, Language Processing, and Software Engineering

Zhe Yu, Tim Menzies

A broad class of software engineering problems can be generalized as the "total recall problem". This short paper claims that identifying and exploring total recall language processing problems in software engineering is an important task with wide applicability. To make that case, we show that by applying and adapting the state of the art active learning and text mining, solutions of the total recall problem, can help solve two important software engineering tasks: (a) supporting large literature reviews and (b) identifying software security vulnerabilities. Furthermore, we conjecture that (c) test case prioritization and (d) static warning identification can also be categorized as the total recall problem. The widespread applicability of "total recall" to software engineering suggests that there exists some underlying framework that encompasses not just natural language processing, but a wide range of important software engineering tasks.

SEMay 8, 2018
Crowdtesting : When is The Party Over?

Junjie Wang, Ye Yang, Zhe Yu et al.

Trade-offs such as "how much testing is enough" are critical yet challenging project decisions in software engineering. Most existing approaches adopt risk-driven or value-based analysis to prioritize test cases and minimize test runs. However, none of these is applicable to the emerging crowd testing paradigm where task requesters typically have no control over online crowdworkers's dynamic behavior and uncertain performance. In current practice, deciding when to close a crowdtesting task is largely done by guesswork due to lack of decision support. This paper intends to fill this gap by introducing automated decision support for monitoring and determining appropriate time to close the crowdtesting tasks. First, this paper investigates the necessity and feasibility of close prediction of crowdtesting tasks based on industrial dataset. Then,it designs 8 methods for close prediction, based on various models including the bug trend, bug arrival model, capture-recapture model.Finally, the evaluation is conducted on 218 crowdtesting tasks from one of the largest crowdtesting platforms in China, and the results show that a median of 91% bugs can be detected with 49% saved cost.

SEMar 17, 2018
Improving Vulnerability Inspection Efficiency Using Active Learning

Zhe Yu, Christopher Theisen, Laurie Williams et al.

Software engineers can find vulnerabilities with less effort if they are directed towards code that might contain more vulnerabilities. HARMLESS is an incremental support vector machine tool that builds a vulnerability prediction model from the sourcecode inspected to date, then suggests what source code files should be inspected next. In this way, HARMLESS can reduce the time and effort required to achieve some desired level of recall for finding vulnerabilities. The tool also provides feedback on when to stop (at that desired level of recall) while at the same time, correcting human errors by double-checking suspicious files. This paper evaluates HARMLESS on Mozilla Firefox vulnerability data. HARMLESS found 80, 90, 95, 99% of the vulnerabilities by inspecting 10, 16, 20, 34% of the source code files. When targeting 90, 95, 99% recall, HARMLESS could stop after inspecting 23, 30, 47% of the source code files. Even when human reviewers fail to identify half of the vulnerabilities (50% false negative rate), HARMLESScould detect 96% of the missing vulnerabilities by double-checking half of the inspected files. Our results serve to highlight the very steep cost of protecting software from vulnerabilities (in our case study that cost is, for example, the human effort of inspecting 28,750$\times$20% = 5,750 source code files to identify 95% of the vulnerabilities). While this result could benefit the mission-critical projects where human resources are available for inspecting thousands of source code files, the research challenge for future work is how to further reduce that cost. The conclusion of this paper discusses various ways that goal might be achieved.

SEJan 30, 2018
Data-Driven Search-based Software Engineering

Vivek Nair, Amritanshu Agrawal, Jianfeng Chen et al.

This paper introduces Data-Driven Search-based Software Engineering (DSE), which combines insights from Mining Software Repositories (MSR) and Search-based Software Engineering (SBSE). While MSR formulates software engineering problems as data mining problems, SBSE reformulates SE problems as optimization problems and use meta-heuristic algorithms to solve them. Both MSR and SBSE share the common goal of providing insights to improve software engineering. The algorithms used in these two areas also have intrinsic relationships. We, therefore, argue that combining these two fields is useful for situations (a) which require learning from a large data source or (b) when optimizers need to know the lay of the land to find better solutions, faster. This paper aims to answer the following three questions: (1) What are the various topics addressed by DSE? (2) What types of data are used by the researchers in this area? (3) What research approaches do researchers use? The paper briefly sets out to act as a practical guide to develop new DSE techniques and also to serve as a teaching resource. This paper also presents a resource (tiny.cc/data-se) for exploring DSE. The resource contains 89 artifacts which are related to DSE, divided into 13 groups such as requirements engineering, software product lines, software processes. All the materials in this repository have been used in recent software engineering papers; i.e., for all this material, there exist baseline results against which researchers can comparatively assess their new ideas.

SEJan 7, 2018
Finding Faster Configurations using FLASH

Vivek Nair, Zhe Yu, Tim Menzies et al.

Finding good configurations for a software system is often challenging since the number of configuration options can be large. Software engineers often make poor choices about configuration or, even worse, they usually use a sub-optimal configuration in production, which leads to inadequate performance. To assist engineers in finding the (near) optimal configuration, this paper introduces FLASH, a sequential model-based method, which sequentially explores the configuration space by reflecting on the configurations evaluated so far to determine the next best configuration to explore. FLASH scales up to software systems that defeat the prior state of the art model-based methods in this area. FLASH runs much faster than existing methods and can solve both single-objective and multi-objective optimization problems. The central insight of this paper is to use the prior knowledge (gained from prior runs) to choose the next promising configuration. This strategy reduces the effort (i.e., number of measurements) required to find the (near) optimal configuration. We evaluate FLASH using 30 scenarios based on 7 software systems to demonstrate that FLASH saves effort in 100% and 80% of cases in single-objective and multi-objective problems respectively by up to several orders of magnitude compared to the state of the art techniques.

SYAug 24, 2017
Online Coherence Identification Using Dynamic Time Warping for Controlled Islanding

Hasan Ul Banna, Zhe Yu, Di Shi et al.

Controlled islanding is considered to be the last countermeasure to prevent system-wide blackouts in case of cascading failures. It splits the system into self-sustained islands to maintain transient stability at the expense of possible loss of load. Generator coherence identification is critical to controlled islanding scheme as it helps identify the optimal cut-set to maintain system transient stability. This paper presents a novel approach for online generator coherency identification using phasor measurement unit (PMU) data and dynamic time warping (DTW). Results from the coherence identification are used to further cluster non-generator buses using spectral clustering with the objective of minimizing power flow disruption. The proposed approach is validated and compared to existing methods on the IEEE 39-bus system, through which its advantages are demonstrated.

SEMay 15, 2017
FAST$^2$: an Intelligent Assistant for Finding Relevant Papers

Zhe Yu, Tim Menzies

Literature reviews are essential for any researcher trying to keep up to date with the burgeoning software engineering literature. FAST$^2$ is a novel tool for reducing the effort required for conducting literature reviews by assisting the researchers to find the next promising paper to read (among a set of unread papers). This paper describes FAST$^2$ and tests it on four large software engineering literature reviews conducted by Wahono (2015), Hall (2012), Radjenović (2013) and Kitchenham (2017). We find that FAST$^2$ is a faster and robust tool to assist researcher finding relevant SE papers which can compensate for the errors made by humans during the review process. The effectiveness of FAST$^2$ can be attributed to three key innovations: (1) a novel way of applying external domain knowledge (a simple two or three keyword search) to guide the initial selection of papers---which helps to find relevant research papers faster with less variances; (2) an estimator of the number of remaining relevant papers yet to be found---which in practical settings can be used to decide if the reviewing process needs to be terminated; (3) a novel self-correcting classification algorithm---automatically corrects itself, in cases where the researcher wrongly classifies a paper.

SEMay 14, 2017
FLASH: A Faster Optimizer for SBSE Tasks

Vivek Nair, Zhe Yu, Tim Menzies

Most problems in search-based software engineering involve balancing conflicting objectives. Prior approaches to this task have required a large number of evaluations- making them very slow to execute and very hard to comprehend. To solve these problems, this paper introduces FLASH, a decision tree based optimizer that incrementally grows one decision tree per objective. These trees are then used to select the next best sample. This paper compares FLASH to state-of-the-art algorithms from search-based SE and machine learning. This comparison uses multiple SBSE case studies for release planning, configuration control, process modeling, and sprint planning for agile development. FLASH was found to be the fastest optimizer (sometimes requiring less than 1% of the evaluations used by evolutionary algorithms). Also, measured in terms of model size, FLASH's reasoning was far more succinct and comprehensible. Further, measured in terms of finding effective optimization, FLASH's recommendations were highly competitive with other approaches. Finally, FLASH scaled to more complex models since it always terminated (while state-of-the-art algorithm did not).

SYJun 16, 2017
A Distributed Cooperative Control Framework for Synchronized Reconnection of a Multi-Bus Microgrid

Di Shi, Xi Chen, Zhiwei Wang et al.

One critical value microgrids bring to power systems is resilience, the capability of being able to island from the main grid under certain conditions and connect back when necessary. Once islanded, a microgrid must be synchronized to the main grid before reconnection to prevent severe consequences. In general, synchronization of a single machine with the grid can be easily achieved using a synchronizer. The problem becomes more challenging when it comes to a multi-bus microgrid with multiple distributed generators (DGs) and dispersed loads. All distributed generators need to be properly controlled in a coordinated way to achieve synchronization. This paper presents a novel bi-level distributed cooperative control framework for a multi-bus microgrid. In this framework, DGs work collaboratively in a distributed manner using the minimum and sparse communication. The topology of the communication network can be flexible which supports the plug-and-play feature of microgrids. Fast and deterministic synchronization can be achieved with tolerance to communication latency. Experimental results obtained from Hardware-in-the-Loop (HIL) simulation demonstrate the effectiveness of the proposed approach.

SEDec 10, 2016
Finding Better Active Learners for Faster Literature Reviews

Zhe Yu, Nicholas A. Kraft, Tim Menzies

Literature reviews can be time-consuming and tedious to complete. By cataloging and refactoring three state-of-the-art active learning techniques from evidence-based medicine and legal electronic discovery, this paper finds and implements FASTREAD, a faster technique for studying a large corpus of documents. This paper assesses FASTREAD using datasets generated from existing SE literature reviews (Hall, Wahono, Radjenović, Kitchenham et al.). Compared to manual methods, FASTREAD lets researchers find 95% relevant studies after reviewing an order of magnitude fewer papers. Compared to other state-of-the-art automatic methods, FASTREAD reviews 20-50% fewer studies while finding same number of relevant primary studies in a systematic literature review.