Yepang Liu

SE
h-index23
27papers
1,123citations
Novelty46%
AI Score58

27 Papers

CRJun 8, 2023
Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li et al.

Large Language Models (LLMs), renowned for their superior proficiency in language comprehension and generation, stimulate a vibrant ecosystem of applications around them. However, their extensive assimilation into various services introduces significant security risks. This study deconstructs the complexities and implications of prompt injection attacks on actual LLM-integrated applications. Initially, we conduct an exploratory analysis on ten commercial applications, highlighting the constraints of current attack strategies in practice. Prompted by these limitations, we subsequently formulate HouYi, a novel black-box prompt injection attack technique, which draws inspiration from traditional web injection attacks. HouYi is compartmentalized into three crucial elements: a seamlessly-incorporated pre-constructed prompt, an injection prompt inducing context partition, and a malicious payload designed to fulfill the attack objectives. Leveraging HouYi, we unveil previously unknown and severe attack outcomes, such as unrestricted arbitrary LLM usage and uncomplicated application prompt theft. We deploy HouYi on 36 actual LLM-integrated applications and discern 31 applications susceptible to prompt injection. 10 vendors have validated our discoveries, including Notion, which has the potential to impact millions of users. Our investigation illuminates both the possible risks of prompt injection attacks and the possible tactics for mitigation.

67.0SEApr 16Code
Towards Understanding Android APIs: Official Lists, Vendor Customizations, and Real-World Usage

Sinan Wang, Qi Zhang, Jiacheng Li et al.

Android apps are built on APIs that abstract core Android system functionalities. These APIs are officially documented in multiple files distributed with the Android source code or SDK, which we collectively refer to as Android API Lists (AALs). Prior Android research has relied on specific AALs, often treating them as interchangeable ground truth. However, recent studies suggest that different AALs can lead to substantially different research outcomes, raising concerns about the validity and reproducibility of Android API-based analyses. To address this issue, we present the first in-depth empirical study of four official AALs that are widely used in prior work. We systematically characterize their contents and analyze their evolution across Android releases. We then perform a fine-grained comparison of the APIs recorded in each AAL to uncover their underlying API inclusion policies and inconsistencies. To assess the practical impact of these differences, we further examine API availability on nine Android devices, including both stock Android and vendor-customized systems. Finally, we analyze API usage in 17,759 real-world Android apps (including open-source apps, commercial apps, and malware) to quantify how the choice of AAL affects empirical Android research. Our results reveal that official AALs are neither stable nor mutually consistent, and that discrepancies among them can substantially influence research conclusions. We also observe that vendor-customized APIs are actively used by normal apps, yet remain largely overlooked by existing studies. Based on these findings, we discuss their implications for Android API-based research and provide actionable suggestions to help researchers select and interpret AALs more reliably.

SESep 17, 2024
Grounded GUI Understanding for Vision-Based Spatial Intelligent Agent: Exemplified by Extended Reality Apps

Shuqing Li, Binchang Li, Yepang Liu et al.

In recent years, spatial computing a.k.a. Extended Reality (XR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual environments. Users can interact with XR apps through interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D) graphical user interface (GUI). The accurate recognition of these IGEs is instrumental, serving as the foundation of many software engineering tasks, including automated testing and effective GUI search. The most recent IGE detection approaches for 2D mobile apps typically train a supervised object detection model based on a large-scale manually-labeled GUI dataset, usually with a pre-defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in XR apps, due to a multitude of challenges including complexities posed by open-vocabulary and heterogeneous IGE categories, intricacies of context-sensitive interactability, and the necessities of precise spatial perception and visual-semantic alignment for accurate IGE detection results. Thus, it is necessary to embark on the IGE research tailored to XR apps. In this paper, we propose the first zero-shot cOntext-sensitive inteRactable GUI ElemeNT dEtection framework for virtual Reality apps, named Orienter. By imitating human behaviors, Orienter observes and understands the semantic contexts of XR app scenes first, before performing the detection. The detection process is iterated within a feedback-directed validation and reflection loop. Specifically, Orienter contains three components, including (1) Semantic context comprehension, (2) Reflection-directed IGE candidate detection, and (3) Context-sensitive interactability classification. Extensive experiments demonstrate that Orienter is more effective than the state-of-the-art GUI element detection approaches.

41.5CRApr 29
LLM-Powered Detection of Price Manipulation in DeFi

Lu Liu, Wuqi Zhang, Lili Wei et al.

Decentralized Finance (DeFi) smart contracts manage billions of dollars, making them a prime target for exploits. Price manipulation vulnerabilities, often via flash loans, are a devastating class of attacks causing significant financial losses. Existing detection methods are limited. Reactive approaches analyze attacks only after they occur, while proactive static analysis tools rely on rigid, predefined heuristics, limiting adaptability. Both depend on known attack patterns, failing to identify novel variants or comprehend complex economic logic. We propose PMDetector, a hybrid framework combining static analysis with Large Language Model (LLM)-based reasoning to proactively detect price manipulation vulnerabilities. Our approach uses a formal attack model and a three-stage pipeline. First, static taint analysis identifies potentially vulnerable code paths. Second, a two-stage LLM process filters paths by analyzing defenses and then simulates attacks to evaluate exploitability. Finally, a static analysis checker validates LLM results, retaining only high-risk paths and generating comprehensive vulnerability reports. To evaluate its effectiveness, we built a dataset of 73 real-world vulnerable and 288 benign DeFi protocols. Results show PMDetector achieves 88% precision and 90% recall with Gemini 2.5-flash, significantly outperforming state-of-the-art static analysis and LLM-based approaches. Auditing a vulnerability with PMDetector costs just $0.03 and takes 4.0 seconds with GPT-4.1, offering an efficient and cost-effective alternative to manual audits.

SEDec 19, 2025
Fairness Is Not Just Ethical: Performance Trade-Off via Data Correlation Tuning to Mitigate Bias in ML Software

Ying Xiao, Shangwen Wang, Sicen Liu et al.

Traditional software fairness research typically emphasizes ethical and social imperatives, neglecting that fairness fundamentally represents a core software quality issue arising directly from performance disparities across sensitive user groups. Recognizing fairness explicitly as a software quality dimension yields practical benefits beyond ethical considerations, notably improved predictive performance for unprivileged groups, enhanced out-of-distribution generalization, and increased geographic transferability in real-world deployments. Nevertheless, existing bias mitigation methods face a critical dilemma: while pre-processing methods offer broad applicability across model types, they generally fall short in effectiveness compared to post-processing techniques. To overcome this challenge, we propose Correlation Tuning (CoT), a novel pre-processing approach designed to mitigate bias by adjusting data correlations. Specifically, CoT introduces the Phi-coefficient, an intuitive correlation measure, to systematically quantify correlation between sensitive attributes and labels, and employs multi-objective optimization to address the proxy biases. Extensive evaluations demonstrate that CoT increases the true positive rate of unprivileged groups by an average of 17.5% and reduces three key bias metrics, including statistical parity difference (SPD), average odds difference (AOD), and equal opportunity difference (EOD), by more than 50% on average. CoT outperforms state-of-the-art methods by three and ten percentage points in single attribute and multiple attributes scenarios, respectively. We will publicly release our experimental results and source code to facilitate future research.

49.8SEMar 30
PCREQ: Automated Inference of Compatible Requirements for Python Third-party Library Upgrades

Huashan Lei, Guanping Xiao, Yepang Liu et al.

Python third-party libraries (TPLs) are essential in modern software development, but upgrades often cause compatibility issues, leading to system failures. These issues fall into two categories: version compatibility issues (VCIs) and code compatibility issues (CCIs). Existing tools mainly detect dependency conflicts but overlook code-level incompatibilities, with no solution fully automating the inference of compatible versions for both VCIs and CCIs. To fill this gap, we propose PCREQ, the first approach to automatically infer compatible requirements by combining version and code compatibility analysis. PCREQ integrates six modules: knowledge acquisition, version compatibility assessment, invoked APIs and modules extraction, code compatibility assessment, version change, and missing TPL completion. PCREQ collects candidate versions, checks for conflicts, identifies API usage, evaluates code compatibility, and iteratively adjusts versions to generate a compatible requirements.txt with a detailed repair report. To evaluate PCREQ, we construct REQBench, a real-world benchmark with 2,095 upgrade scenarios derived from 34 real-world scientific/ML Python projects. Results show PCREQ achieves a 94.03% inference success rate, outperforming PyEGo (37.02%), ReadPyE (37.16%), and LLM-based approaches (GPT-4o, DeepSeek V3/R1) by 18--22%. PCREQ processes each scenario from REQBench in 60.79 s on average, demonstrating practical efficiency. PCREQ reduces manual effort in troubleshooting upgrades, advancing Python dependency maintenance automation.

AIMay 26, 2025Code
AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare

Ying Xiao, Jie Huang, Ruijuan He et al.

Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their mistakes and the biases behind them pose life-critical risks. Bias linked to race, sex, and socioeconomic status is already well known, but a consistent and automatic testbed for measuring it is missing. To fill this gap, this paper presents AMQA -- an Adversarial Medical Question-Answering dataset -- built for automated, large-scale bias evaluation of LLMs in medical QA. AMQA includes 4,806 medical QA pairs sourced from the United States Medical Licensing Examination (USMLE) dataset, generated using a multi-agent framework to create diverse adversarial descriptions and question pairs. Using AMQA, we benchmark five representative LLMs and find surprisingly substantial disparities: even GPT-4.1, the least biased model tested, answers privileged-group questions over 10 percentage points more accurately than unprivileged ones. Compared with the existing benchmark CPV, AMQA reveals 15% larger accuracy gaps on average between privileged and unprivileged groups. Our dataset and code are publicly available at https://github.com/XY-Showing/AMQA to support reproducible research and advance trustworthy, bias-aware medical AI.

SEJan 29, 2022Code
Aper: Evolution-Aware Runtime Permission Misuse Detection for Android Apps

Sinan Wang, Yibo Wang, Xian Zhan et al.

The Android platform introduces the runtime permission model in version 6.0. The new model greatly improves data privacy and user experience, but brings new challenges for app developers. First, it allows users to freely revoke granted permissions. Hence, developers cannot assume that the permissions granted to an app would keep being granted. Instead, they should make their apps carefully check the permission status before invoking dangerous APIs. Second, the permission specification keeps evolving, bringing new types of compatibility issues into the ecosystem. To understand the impact of the challenges, we conducted an empirical study on 13,352 popular Google Play apps. We found that 86.0% apps used dangerous APIs asynchronously after permission management and 61.2% apps used evolving dangerous APIs. If an app does not properly handle permission revocations or platform differences, unexpected runtime issues may happen and even cause app crashes. We call such Android Runtime Permission issues as ARP bugs. Unfortunately, existing runtime permission issue detection tools cannot effectively deal with the ARP bugs induced by asynchronous permission management and permission specification evolution. To fill the gap, we designed a static analyzer, Aper, that performs reaching definition and dominator analysis on Android apps to detect the two types of ARP bugs. To compare Aper with existing tools, we built a benchmark, ARPfix, from 60 real ARP bugs. Our experiment results show that Aper significantly outperforms two academic tools, ARPDroid and RevDroid, and an industrial tool, Lint, on ARPfix, with an average improvement of 46.3% on F1-score. In addition, Aper successfully found 34 ARP bugs in 214 opensource Android apps, most of which can result in abnormal app behaviors (such as app crashes) according to our manual validation.

SENov 10, 2021Code
A Comprehensive Evaluation of Android ICC Resolution Techniques

Jiwei Yan, Shixin Zhang, Yepang Liu et al.

Inter-component communication (ICC) is a widely used mechanism in mobile apps, which enables message-based control flow transferring and data passing between Android components. Effective ICC resolution requires precisely identifying entry points, analyzing data values of ICC fields, modeling related framework APIs, etc. Due to various control-flow- and data-flow-related characteristics involved and the lack of oracles for real-world apps, the comprehensive evaluation of ICC resolution techniques is challenging. To fill this gap, we collect multiple-type benchmark suites with 4,104 apps, covering hand-made apps, open-source, and commercial ones. Considering their differences, various evaluation metrics, e.g., number count, graph structure, and reliable oracle based metrics, are adopted on-demand. As the oracle for real-world apps is unavailable, we design a dynamic analysis approach to extract the real ICC links triggered during GUI exploration. By auditing the code implementations, we carefully check the extracted ICCs and confirm 1,680 ones to form a reliable oracle set, in which each ICC is labeled with 25 code characteristic tags. The evaluation performed on six state-of-the-art ICC resolution tools shows that 1) the completeness of static ICC resolution results on real-world apps is not satisfactory, as up to 38%-85% ICCs are missed by tools; 2) many wrongly reported ICCs are sent from or received by only a few components and the graph structure information can help the identification; 3) the efficiency of fundamental tools, like ICC resolution ones, should be optimized in both engineering and research aspects. By investigating both the missed and wrongly reported ICCs, we discuss the strengths of different tools for users and summarize eight common FN/FP patterns in ICC resolution for tool developers.

SEAug 4, 2021Code
A Systematic Assessment on Android Third-party Library Detection Tools

Xian Zhan, Tianming Liu, Yepang Liu et al.

Third-party libraries (TPLs) have become a significant part of the Android ecosystem. Developers can employ various TPLs to facilitate their app development. Unfortunately, the popularity of TPLs also brings new security issues. For example, TPLs may carry malicious or vulnerable code, which can infect popular apps to pose threats to mobile users. Furthermore, TPL detection is essential for downstream tasks, such as vulnerabilities and malware detection. Thus, various tools have been developed to identify TPLs. However, no existing work has studied these TPL detection tools in detail, and different tools focus on different applications and techniques with performance differences. A comprehensive understanding of these tools will help us make better use of them. To this end, we conduct a comprehensive empirical study to fill the gap by evaluating and comparing all publicly available TPL detection tools based on six criteria: accuracy of TPL construction, effectiveness, efficiency, accuracy of version identification, resiliency to code obfuscation, and ease of use. Besides, we enhance these open-source tools by fixing their limitations, to improve their detection ability. Finally, we build an extensible framework that integrates all existing available TPL detection tools, providing an online service for the research community. We release the evaluation dataset and enhanced tools. According to our study, we also present the essential findings and discuss promising implications to the community. We believe our work provides a clear picture of existing TPL detection techniques and also gives a roadmap for future research.

SEJun 24, 2021Code
Runtime Permission Issues in Android Apps: Taxonomy, Practices, and Ways Forward

Ying Wang, Yibo Wang, Sinan Wang et al.

Android introduces a new permission model that allows apps to request permissions at runtime rather than at the installation time since 6.0 (Marshmallow, API level 23). While this runtime permission model provides users with greater flexibility in controlling an app's access to sensitive data and system features, it brings new challenges to app development. First, as users may grant or revoke permissions at any time while they are using an app, developers need to ensure that the app properly checks and requests required permissions before invoking any permission-protected APIs. Second, Android's permission mechanism keeps evolving and getting customized by device manufacturers. Developers are expected to comprehensively test their apps on different Android versions and device models to make sure permissions are properly requested in all situations. Unfortunately, these requirements are often impractical for developers. In practice, many Android apps suffer from various runtime permission issues (ARP issues). While existing studies have explored ARP issues, the understanding of such issues is still preliminary. To better characterize ARP issues, we performed an empirical study using 135 Stack Overflow posts that discuss ARP issues and 199 real ARP issues archived in popular open-source Android projects on GitHub. Via analyzing the data, we observed 11 types of ARP issues that commonly occur in Android apps. Furthermore, we conducted a field survey and in-depth interviews among practitioners, to gain insights from industrial practices and learn practitioners' requirements of tools that can help combat ARP issues. We hope that our findings can shed light on future research and provide useful guidance to practitioners.

SEJun 13, 2020Code
Will Dependency Conflicts Affect My Program's Semantics?

Ying Wang, Rongxin Wu, Chao Wang et al.

Java projects are often built on top of various third-party libraries. If multiple versions of a library exist on the classpath, JVM will only load one version and shadow the others, which we refer to as dependency conflicts. This would give rise to semantic conflict (SC) issues, if the library APIs referenced by a project have identical method signatures but inconsistent semantics across the loaded and shadowed versions of libraries. SC issues are difficult for developers to diagnose in practice, since understanding them typically requires domain knowledge. Although adapting the existing test generation technique for dependency conflict issues, Riddle, to detect SC issues is feasible, its effectiveness is greatly compromised. This is mainly because Riddle randomly generates test inputs, while the SC issues typically require specific arguments in the tests to be exposed. To address that, we conducted an empirical study of 75 real SC issues to understand the characteristics of such specific arguments in the test cases that can capture the SC issues. Inspired by our empirical findings, we propose an automated testing technique Sensor, which synthesizes test cases using ingredients from the project under test to trigger inconsistent behaviors of the APIs with the same signatures in conflicting library versions. Our evaluation results show that \textsc{Sensor} is effective and useful: it achieved a $Precision$ of 0.803 and a $Recall$ of 0.760 on open-source projects and a $Precision$ of 0.821 on industrial projects; it detected 150 semantic conflict issues in 29 projects, 81.8\% of which had been confirmed as real bugs.

SENov 24, 2016Code
DroidLeaks: Benchmarking Resource Leak Bugs for Android Applications

Yepang Liu, Lili Wei, Chang Xu et al.

Resource leak bugs in Android apps are pervasive and can cause serious performance degradation and system crashes. In recent years, several resource leak detection techniques have been proposed to assist Android developers in correctly managing system resources. Yet, there exist no common bug benchmarks for effectively and reliably comparing such techniques and quantitatively evaluating their strengths and weaknesses. This paper describes our initial contribution towards constructing such a benchmark. To locate real resource leak bugs, we mined 124,215 code revisions of 34 large-scale open-source Android apps. We successfully found 298 fixed resource leaks, which cover a diverse set of resource classes, from 32 out of the 34 apps. To understand the characteristics of these bugs, we conducted an empirical study, which revealed the root causes of frequent resource leaks in Android apps and common patterns of faults made by developers. With our findings, we further implemented a static checker to detect a common pattern of resource leaks in Android apps. Experiments showed that the checker can effectively locate real resource leaks in popular Android apps, confirming the usefulness of our work.

78.9LOApr 27
Understanding and Improving Automated Proof Synthesis for Interactive Theorem Provers

Manqing Zhang, Yunwei Dong, Lingru Zhou et al.

Formal verification using interactive theorem provers ensures high-quality software. However, writing proof scripts for interactive theorem provers is labor-intensive and requires deep expertise. Recent studies have leveraged deep learning to automate theorem proving by learning from manually written proof corpora. Nevertheless, these techniques still achieve limited success rates in proof synthesis. This paper investigates the theorems that current proof synthesis techniques are unable to prove and analyzes their characteristics. Through an in-depth analysis of the proven theorems, proof scripts, and the proof search process, we identify the limitations of existing tools and summarize the key factors influencing proof success rates. Our research provides valuable insights for the future optimization of automated proof tools. Based on our empirical study, we discover that tactic selections conforming to human expert proof patterns are more likely to lead to successful proofs. Inspired by this finding, we propose a pattern-guided tactic search (PGTS) method to heuristically enhance the search process of existing proof synthesis tools. Our evaluation experiments demonstrate that PGTS improves the number of theorems proved by existing proof synthesis tools by an average of 8.05\%, while also achieving an average 20\% increase in previously unproven theorems. Furthermore, PGTS enhances the capability of proof synthesis tools to prove complex theorems and generates more concise proof scripts.

CVJun 8, 2025
Backdoor Attack on Vision Language Models with Stealthy Semantic Manipulation

Zhiyuan Zhong, Zhen Sun, Yepang Liu et al.

Vision Language Models (VLMs) have shown remarkable performance, but are also vulnerable to backdoor attacks whereby the adversary can manipulate the model's outputs through hidden triggers. Prior attacks primarily rely on single-modality triggers, leaving the crucial cross-modal fusion nature of VLMs largely unexplored. Unlike prior work, we identify a novel attack surface that leverages cross-modal semantic mismatches as implicit triggers. Based on this insight, we propose BadSem (Backdoor Attack with Semantic Manipulation), a data poisoning attack that injects stealthy backdoors by deliberately misaligning image-text pairs during training. To perform the attack, we construct SIMBad, a dataset tailored for semantic manipulation involving color and object attributes. Extensive experiments across four widely used VLMs show that BadSem achieves over 98% average ASR, generalizes well to out-of-distribution datasets, and can transfer across poisoning modalities. Our detailed analysis using attention visualization shows that backdoored models focus on semantically sensitive regions under mismatched conditions while maintaining normal behavior on clean inputs. To mitigate the attack, we try two defense strategies based on system prompt and supervised fine-tuning but find that both of them fail to mitigate the semantic backdoor. Our findings highlight the urgent need to address semantic vulnerabilities in VLMs for their safer deployment.

SEOct 6, 2025
Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches

Yicheng Tao, Yao Qin, Yepang Liu

Recent advancements in large language models (LLMs) have substantially improved automated code generation. While function-level and file-level generation have achieved promising results, real-world software development typically requires reasoning across entire repositories. This gives rise to the challenging task of Repository-Level Code Generation (RLCG), where models must capture long-range dependencies, ensure global semantic consistency, and generate coherent code spanning multiple files or modules. To address these challenges, Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm that integrates external retrieval mechanisms with LLMs, enhancing context-awareness and scalability. In this survey, we provide a comprehensive review of research on Retrieval-Augmented Code Generation (RACG), with an emphasis on repository-level approaches. We categorize existing work along several dimensions, including generation strategies, retrieval modalities, model architectures, training paradigms, and evaluation protocols. Furthermore, we summarize widely used datasets and benchmarks, analyze current limitations, and outline key challenges and opportunities for future research. Our goal is to establish a unified analytical framework for understanding this rapidly evolving field and to inspire continued progress in AI-powered software engineering.

CLOct 28, 2025
ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?

Shuqing Li, Jiayi Yan, Chenyu Niu et al. · pku, tencent-ai

Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.

SEJun 13, 2024
Less Cybersickness, Please: Demystifying and Detecting Stereoscopic Visual Inconsistencies in Virtual Reality Apps

Shuqing Li, Cuiyun Gao, Jianping Zhang et al.

The quality of Virtual Reality (VR) apps is vital, particularly the rendering quality of the VR Graphical User Interface (GUI). Different from traditional 2D apps, VR apps create a 3D digital scene for users, by rendering two distinct 2D images for the user's left and right eyes, respectively. Stereoscopic visual inconsistency (denoted as "SVI") issues, however, undermine the rendering process of the user's brain, leading to user discomfort and even adverse health effects. Such issues commonly exist but remain underexplored. We conduct an empirical analysis on 282 SVI bug reports from 15 VR platforms, summarizing 15 types of manifestations. The empirical analysis reveals that automatically detecting SVI issues is challenging, mainly because: (1) lack of training data; (2) the manifestations of SVI issues are diverse, complicated, and often application-specific; (3) most accessible VR apps are closed-source commercial software. Existing pattern-based supervised classification approaches may be inapplicable or ineffective in detecting the SVI issues. To counter these challenges, we propose an unsupervised black-box testing framework named StereoID to identify the stereoscopic visual inconsistencies, based only on the rendered GUI states. StereoID generates a synthetic right-eye image based on the actual left-eye image and computes distances between the synthetic right-eye image and the actual right-eye image to detect SVI issues. We propose a depth-aware conditional stereo image translator to power the image generation process, which captures the expected perspective shifts between left-eye and right-eye images. We build a large-scale unlabeled VR stereo screenshot dataset with larger than 171K images from 288 real-world VR apps for experiments. After substantial experiments, StereoID demonstrates superior performance for detecting SVI issues in both user reports and wild VR apps.

LGMay 23, 2023
FITNESS: A Causal De-correlation Approach for Mitigating Bias in Machine Learning Software

Ying Xiao, Shangwen Wang, Sicen Liu et al.

Software built on top of machine learning algorithms is becoming increasingly prevalent in a variety of fields, including college admissions, healthcare, insurance, and justice. The effectiveness and efficiency of these systems heavily depend on the quality of the training datasets. Biased datasets can lead to unfair and potentially harmful outcomes, particularly in such critical decision-making systems where the allocation of resources may be affected. This can exacerbate discrimination against certain groups and cause significant social disruption. To mitigate such unfairness, a series of bias-mitigating methods are proposed. Generally, these studies improve the fairness of the trained models to a certain degree but with the expense of sacrificing the model performance. In this paper, we propose FITNESS, a bias mitigation approach via de-correlating the causal effects between sensitive features (e.g., the sex) and the label. Our key idea is that by de-correlating such effects from a causality perspective, the model would avoid making predictions based on sensitive features and thus fairness could be improved. Furthermore, FITNESS leverages multi-objective optimization to achieve a better performance-fairness trade-off. To evaluate the effectiveness, we compare FITNESS with 7 state-of-the-art methods in 8 benchmark tasks by multiple metrics. Results show that FITNESS can outperform the state-of-the-art methods on bias mitigation while preserve the model's performance: it improved the model's fairness under all the scenarios while decreased the model's performance under only 26.67% of the scenarios. Additionally, FITNESS surpasses the Fairea Baseline in 96.72% cases, outperforming all methods we compared.

SESep 1, 2021
Characterizing and Detecting Configuration Compatibility Issues in Android Apps

Huaxun Huang, Ming Wen, Lili Wei et al.

XML configuration files are widely used in Android to define an app's user interface and essential runtime information such as system permissions. As Android evolves, it might introduce functional changes in the configuration environment, thus causing compatibility issues that manifest as inconsistent app behaviors at different API levels. Such issues can often induce software crashes and inconsistent look-and-feel when running at specific Android versions. Existing works incur plenty of false positive and false negative issue-detection rules by conducting trivial data-flow analysis while failing to model the XML tree hierarchies of the Android configuration files. Besides, little is known about how the changes in an Android framework can induce such compatibility issues. To bridge such gaps, we conducted a systematic study by analyzing 196 real-world issues collected from 43 popular apps. We identified common patterns of Android framework code changes that induce such configuration compatibility issues. Based on the findings, we propose \textsc{ConfDroid} that can automatically extract rules for detecting configuration compatibility issues. The intuition is to perform symbolic execution based on a model learned from the common code change patterns. Experiment results show that ConfDroid can successfully extract 282 valid issue-detection rules with a precision of 91.9%. Among them, 65 extracted rules can manifest issues that cannot be detected by the rules of state-of-the-art baselines. More importantly, 11 out of them have led to the detection of 107 reproducible configuration compatibility issues that the baselines cannot detect in 30 out of 316 real-world Android apps.

CRAug 24, 2021
Characterizing Transaction-Reverting Statements in Ethereum Smart Contracts

Lu Liu, Lili Wei, Wuqi Zhang et al.

Smart contracts are programs running on blockchain to execute transactions. When input constraints or security properties are violated at runtime, the transaction being executed by a smart contract needs to be reverted to avoid undesirable consequences. On Ethereum, the most popular blockchain that supports smart contracts, developers can choose among three transaction-reverting statements (i.e., require, if...revert, and if...throw) to handle anomalous transactions. While these transaction-reverting statements are vital for preventing smart contracts from exhibiting abnormal behaviors or suffering malicious attacks, there is limited understanding of how they are used in practice. In this work, we perform the first empirical study to characterize transaction-reverting statements in Ethereum smart contracts. We measured the prevalence of these statements in 3,866 verified smart contracts from popular dapps and built a taxonomy of their purposes via manually analyzing 557 transaction-reverting statements. We also compared template contracts and their corresponding custom contracts to understand how developers customize the use of transaction-reverting statements. Finally, we analyzed the security impact of transaction-reverting statements by removing them from smart contracts and comparing the mutated contracts against the original ones. Our study led to important findings, which can shed light on further research in the broad area of smart contract quality assurance and provide practical guidance to smart contract developers on the appropriate use of transaction-reverting statements.

SEJun 17, 2021
ÐArcher: Detecting On-Chain-Off-Chain Synchronization Bugs in Decentralized Applications

Wuqi Zhang, Lili Wei, Shuqing Li et al.

Since the emergence of Ethereum, blockchain-based decentralized applications (DApps) have become increasingly popular and important. To balance the security, performance, and costs, a DApp typically consists of two layers: an on-chain layer to execute transactions and store crucial data on the blockchain and an off-chain layer to interact with users. A DApp needs to synchronize its off-chain layer with the on-chain layer proactively. Otherwise, the inconsistent data in the off-chain layer could mislead users and cause undesirable consequences, e.g., loss of transaction fees. However, transactions sent to the blockchain are not guaranteed to be executed and could even be reversed after execution due to chain reorganization. Such non-determinism in the transaction execution is unique to blockchain. DApp developers may fail to perform the on-chain-off-chain synchronization accurately due to their lack of familiarity with the complex transaction lifecycle. In this work, we investigate the challenges of synchronizing on-chain and off-chain data in Ethereum-based DApps. We present two types of bugs that could result in inconsistencies between the on-chain and off-chain layers. To help detect such on-chain-off-chain synchronization bugs, we introduce a state transition model to guide the testing of DApps and propose two effective oracles to facilitate the automatic identification of bugs. We build the first testing framework, DArcher, to detect on-chain-off-chain synchronization bugs in DApps. We have evaluated DArcher on 11 popular real-world DApps. DArcher achieves high precision (99.3%), recall (87.6%), and accuracy (89.4%) in bug detection and significantly outperforms the baseline methods. It has found 15 real bugs in the 11 DApps. So far, six of the 15 bugs have been confirmed by the developers, and three have been fixed. These promising results demonstrate the usefulness of DArcher.

SEMar 10, 2021
Automatic Web Testing using Curiosity-Driven Reinforcement Learning

Yan Zheng, Yi Liu, Xiaofei Xie et al.

Web testing has long been recognized as a notoriously difficult task. Even nowadays, web testing still heavily relies on manual efforts while automated web testing is far from achieving human-level performance. Key challenges in web testing include dynamic content update and deep bugs hiding under complicated user interactions and specific input values, which can only be triggered by certain action sequences in the huge search space. In this paper, we propose WebExplor, an automatic end-to-end web testing framework, to achieve an adaptive exploration of web applications. WebExplor adopts curiosity-driven reinforcement learning to generate high-quality action sequences (test cases) satisfying temporal logical relations. Besides, WebExplor incrementally builds an automaton during the online testing process, which provides high-level guidance to further improve the testing efficiency. We have conducted comprehensive evaluations of WebExplor on six real-world projects, a commercial SaaS web application, and performed an in-the-wild study of the top 50 web applications in the world. The results demonstrate that in most cases WebExplor can achieve a significantly higher failure detection rate, code coverage, and efficiency than existing state-of-the-art web testing techniques. WebExplor also detected 12 previously unknown failures in the commercial web application, which have been confirmed and fixed by the developers. Furthermore, our in-the-wild study further uncovered 3,466 exceptions and errors.

CRMar 9, 2021
Deep Learning for Android Malware Defenses: a Systematic Literature Review

Yue Liu, Chakkrit Tantithamthavorn, Li Li et al.

Malicious applications (particularly those targeting the Android platform) pose a serious threat to developers and end-users. Numerous research efforts have been devoted to developing effective approaches to defend against Android malware. However, given the explosive growth of Android malware and the continuous advancement of malicious evasion technologies like obfuscation and reflection, Android malware defense approaches based on manual rules or traditional machine learning may not be effective. In recent years, a dominant research field called deep learning (DL), which provides a powerful feature abstraction ability, has demonstrated a compelling and promising performance in a variety of areas, like natural language processing and computer vision. To this end, employing deep learning techniques to thwart Android malware attacks has recently garnered considerable research attention. Yet, no systematic literature review focusing on deep learning approaches for Android Malware defenses exists. In this paper, we conducted a systematic literature review to search and analyze how deep learning approaches have been applied in the context of malware defenses in the Android environment. As a result, a total of 132 studies covering the period 2014-2021 were identified. Our investigation reveals that, while the majority of these sources mainly consider DL-based on Android malware detection, 53 primary studies (40.1 percent) design defense approaches based on other scenarios. This review also discusses research trends, research focuses, challenges, and future research directions in DL-based Android malware defenses.

SEFeb 24, 2021
Hero: On the Chaos When PATH Meets Modules

Ying Wang, Liang Qiao, Chang Xu et al.

Ever since its first release in 2009, the Go programming language (Golang) has been well received by software communities. A major reason for its success is the powerful support of library-based development, where a Golang project can be conveniently built on top of other projects by referencing them as libraries. As Golang evolves, it recommends the use of a new library-referencing mode to overcome the limitations of the original one. While these two library modes are incompatible, both are supported by the Golang ecosystem. The heterogeneous use of library-referencing modes across Golang projects has caused numerous dependency management (DM) issues, incurring reference inconsistencies and even build failures. Motivated by the problem, we conducted an empirical study to characterize the DM issues, understand their root causes, and examine their fixing solutions. Based on our findings, we developed \textsc{Hero}, an automated technique to detect DM issues and suggest proper fixing solutions. We applied \textsc{Hero} to 19,000 popular Golang projects. The results showed that \textsc{Hero} achieved a high detection rate of 98.5\% on a DM issue benchmark and found 2,422 new DM issues in 2,356 popular Golang projects. We reported 280 issues, among which 181 (64.6\%) issues have been confirmed, and 160 of them (88.4\%) have been fixed or are under fixing. Almost all the fixes have adopted our fixing suggestions.

LGSep 6, 2019
Testing Deep Learning Models for Image Analysis Using Object-Relevant Metamorphic Relations

Yongqiang Tian, Shiqing Ma, Ming Wen et al.

Deep learning models are widely used for image analysis. While they offer high performance in terms of accuracy, people are concerned about if these models inappropriately make inferences using irrelevant features that are not encoded from the target object in a given image. To address the concern, we propose a metamorphic testing approach that assesses if a given inference is made based on irrelevant features. Specifically, we propose two novel metamorphic relations to detect such inappropriate inferences. We applied our approach to 10 image classification models and 10 object detection models, with three large datasets, i.e., ImageNet, COCO, and Pascal VOC. Over 5.3% of the top-5 correct predictions made by the image classification models are subject to inappropriate inferences using irrelevant features. The corresponding rate for the object detection models is over 8.5%. Based on the findings, we further designed a new image generation strategy that can effectively attack existing models. Comparing with a baseline approach, our strategy can double the success rate of attacks.

SEJan 4, 2019
Detecting and Diagnosing Energy Issues for Mobile Applications

Xueliang Li, Yuming Yang, Yepang liu et al.

Energy efficiency is an important criterion to judge the quality of mobile apps, but one third of our randomly sampled apps suffer from energy issues that can quickly drain battery power. To understand these issues, we conducted an empirical study on 27 well-maintained apps such as Chrome and Firefox, whose issue tracking systems are publicly accessible. Our study revealed that the main root causes of energy issues include unnecessary workload and excessively frequent operations. Surprisingly, these issues are beyond the application of present technology on energy issue detection. We also found that 20.6% of energy issues can only manifest themselves under specific contexts such as poor network performance, but such contexts are again neglected by present technology. Therefore, we proposed a novel testing framework for detecting energy issues in real-world apps. Our framework examines apps with well-designed input sequences and runtime contexts. To identify the root causes mentioned above, we employed a machine learning algorithm to cluster the workloads and further evaluate their necessity. For the issues concealed by the specific contexts, we carefully set up several execution contexts to pinpoint them. More importantly, we developed leading edge technology, e.g. pre-designing input sequences with potential energy overuse and tuning tests on-the-fly, to achieve high efficacy in detecting energy issues. A large-scale evaluation shows that 91.6% issues detected in our test were previously unknown to developers. On average, these issues double the energy costs of the apps. Furthermore, our test achieves a low number of false positives. Finally, we show how our test reports can help developers fix the issues.