Lili Wei

SE
h-index18
12papers
56citations
Novelty40%
AI Score52

12 Papers

SEApr 16Code
Towards Understanding Android APIs: Official Lists, Vendor Customizations, and Real-World Usage

Sinan Wang, Qi Zhang, Jiacheng Li et al.

Android apps are built on APIs that abstract core Android system functionalities. These APIs are officially documented in multiple files distributed with the Android source code or SDK, which we collectively refer to as Android API Lists (AALs). Prior Android research has relied on specific AALs, often treating them as interchangeable ground truth. However, recent studies suggest that different AALs can lead to substantially different research outcomes, raising concerns about the validity and reproducibility of Android API-based analyses. To address this issue, we present the first in-depth empirical study of four official AALs that are widely used in prior work. We systematically characterize their contents and analyze their evolution across Android releases. We then perform a fine-grained comparison of the APIs recorded in each AAL to uncover their underlying API inclusion policies and inconsistencies. To assess the practical impact of these differences, we further examine API availability on nine Android devices, including both stock Android and vendor-customized systems. Finally, we analyze API usage in 17,759 real-world Android apps (including open-source apps, commercial apps, and malware) to quantify how the choice of AAL affects empirical Android research. Our results reveal that official AALs are neither stable nor mutually consistent, and that discrepancies among them can substantially influence research conclusions. We also observe that vendor-customized APIs are actively used by normal apps, yet remain largely overlooked by existing studies. Based on these findings, we discuss their implications for Android API-based research and provide actionable suggestions to help researchers select and interpret AALs more reliably.

CRApr 29
LLM-Powered Detection of Price Manipulation in DeFi

Lu Liu, Wuqi Zhang, Lili Wei et al.

Decentralized Finance (DeFi) smart contracts manage billions of dollars, making them a prime target for exploits. Price manipulation vulnerabilities, often via flash loans, are a devastating class of attacks causing significant financial losses. Existing detection methods are limited. Reactive approaches analyze attacks only after they occur, while proactive static analysis tools rely on rigid, predefined heuristics, limiting adaptability. Both depend on known attack patterns, failing to identify novel variants or comprehend complex economic logic. We propose PMDetector, a hybrid framework combining static analysis with Large Language Model (LLM)-based reasoning to proactively detect price manipulation vulnerabilities. Our approach uses a formal attack model and a three-stage pipeline. First, static taint analysis identifies potentially vulnerable code paths. Second, a two-stage LLM process filters paths by analyzing defenses and then simulates attacks to evaluate exploitability. Finally, a static analysis checker validates LLM results, retaining only high-risk paths and generating comprehensive vulnerability reports. To evaluate its effectiveness, we built a dataset of 73 real-world vulnerable and 288 benign DeFi protocols. Results show PMDetector achieves 88% precision and 90% recall with Gemini 2.5-flash, significantly outperforming state-of-the-art static analysis and LLM-based approaches. Auditing a vulnerability with PMDetector costs just $0.03 and takes 4.0 seconds with GPT-4.1, offering an efficient and cost-effective alternative to manual audits.

CRDec 17, 2025
SeBERTis: A Framework for Producing Classifiers of Security-Related Issue Reports

Sogol Masoumzadeh, Yufei Li, Shane McIntosh et al.

Monitoring issue tracker submissions is a crucial software maintenance activity. A key goal is the prioritization of high risk, security-related bugs. If such bugs can be recognized early, the risk of propagation to dependent products and endangerment of stakeholder benefits can be mitigated. To assist triage engineers with this task, several automatic detection techniques, from Machine Learning (ML) models to prompting Large Language Models (LLMs), have been proposed. Although promising to some extent, prior techniques often memorize lexical cues as decision shortcuts, yielding low detection rate specifically for more complex submissions. As such, these classifiers do not yet reach the practical expectations of a real-time detector of security-related issues. To address these limitations, we propose SEBERTIS, a framework to train Deep Neural Networks (DNNs) as classifiers independent of lexical cues, so that they can confidently detect fully unseen security-related issues. SEBERTIS capitalizes on fine-tuning bidirectional transformer architectures as Masked Language Models (MLMs) on a series of semantically equivalent vocabulary to prediction labels (which we call Semantic Surrogates) when they have been replaced with a mask. Our SEBERTIS-trained classifier achieves a 0.9880 F1-score in detecting security-related issues of a curated corpus of 10,000 GitHub issue reports, substantially outperforming state-of-the-art issue classifiers, with 14.44%-96.98%, 15.40%-93.07%, and 14.90%-94.72% higher detection precision, recall, and F1-score over ML-based baselines. Our classifier also substantially surpasses LLM baselines, with an improvement of 23.20%-63.71%, 36.68%-85.63%, and 39.49%-74.53% for precision, recall, and F1-score.

CVDec 9, 2025
Query-aware Hub Prototype Learning for Few-Shot 3D Point Cloud Semantic Segmentation

YiLin Zhou, Lili Wei, Zheming Xu et al.

Few-shot 3D point cloud semantic segmentation (FS-3DSeg) aims to segment novel classes with only a few labeled samples. However, existing metric-based prototype learning methods generate prototypes solely from the support set, without considering their relevance to query data. This often results in prototype bias, where prototypes overfit support-specific characteristics and fail to generalize to the query distribution, especially in the presence of distribution shifts, which leads to degraded segmentation performance. To address this issue, we propose a novel Query-aware Hub Prototype (QHP) learning method that explicitly models semantic correlations between support and query sets. Specifically, we propose a Hub Prototype Generation (HPG) module that constructs a bipartite graph connecting query and support points, identifies frequently linked support hubs, and generates query-relevant prototypes that better capture cross-set semantics. To further mitigate the influence of bad hubs and ambiguous prototypes near class boundaries, we introduce a Prototype Distribution Optimization (PDO) module, which employs a purity-reweighted contrastive loss to refine prototype representations by pulling bad hubs and outlier prototypes closer to their corresponding class centers. Extensive experiments on S3DIS and ScanNet demonstrate that QHP achieves substantial performance gains over state-of-the-art methods, effectively narrowing the semantic gap between prototypes and query sets in FS-3DSeg.

SEApr 6, 2021Code
Logging Practices with Mobile Analytics: An Empirical Study on Firebase

Julian Harty, Haonan Zhang, Lili Wei et al.

Software logs are of great value in both industrial and open-source projects. Mobile analytics logging enables developers to collect logs remotely from their apps running on end user devices at the cost of recording and transmitting logs across the Internet to a centralised infrastructure. This paper makes a first step in characterising logging practices with a widely adopted mobile analytics logging library, namely Firebase Analytics. We provide an empirical evaluation of the use of Firebase Analytics in 57 open-source Android applications by studying the evolution of code-bases to understand: a) the needs-in-common that push practitioners to adopt logging practices on mobile devices, and b) the differences in the ways developers use local and remote logging. Our results indicate mobile analytics logs are less pervasive and less maintained than traditional logging code. Based on our analysis, we believe logging using mobile analytics is more user centered compared to traditional logging, where the latter is mainly used to record information for debugging purposes.

SENov 24, 2016Code
DroidLeaks: Benchmarking Resource Leak Bugs for Android Applications

Yepang Liu, Lili Wei, Chang Xu et al.

Resource leak bugs in Android apps are pervasive and can cause serious performance degradation and system crashes. In recent years, several resource leak detection techniques have been proposed to assist Android developers in correctly managing system resources. Yet, there exist no common bug benchmarks for effectively and reliably comparing such techniques and quantitatively evaluating their strengths and weaknesses. This paper describes our initial contribution towards constructing such a benchmark. To locate real resource leak bugs, we mined 124,215 code revisions of 34 large-scale open-source Android apps. We successfully found 298 fixed resource leaks, which cover a diverse set of resource classes, from 32 out of the 34 apps. To understand the characteristics of these bugs, we conducted an empirical study, which revealed the root causes of frequent resource leaks in Android apps and common patterns of faults made by developers. With our findings, we further implemented a static checker to detect a common pattern of resource leaks in Android apps. Experiments showed that the checker can effectively locate real resource leaks in popular Android apps, confirming the usefulness of our work.

SEApr 9
MIMIC-Py: An Extensible Tool for Personality-Driven Automated Game Testing with Large Language Models

Yifei Chen, Sarra Habchi, Lili Wei

Modern video games are complex, non-deterministic systems that are difficult to test automatically at scale. Although prior work shows that personality-driven Large Language Model (LLM) agents can improve behavioural diversity and test coverage, existing tools largely remain research prototypes and lack cross-game reusability. This tool paper presents MIMIC-Py, a Python-based automated game-testing tool that transforms personality-driven LLM agents into a reusable and extensible framework. MIMIC-Py exposes personality traits as configurable inputs and adopts a modular architecture that decouples planning, execution, and memory from game-specific logic. It supports multiple interaction mechanisms, enabling agents to interact with games via exposed APIs or synthesized code. We describe the design of MIMIC-Py and show how it enables deployment to new game environments with minimal engineering effort, bridging the gap between research prototypes and practical automated game testing. The source code and a demo video are available on our project webpage: https://mimic-persona.github.io/MIMIC-Py-Home-Page/.

NIJul 28, 2025
Deep Reinforcement Learning-based Cell DTX/DRX Configuration for Network Energy Saving

Wei Mao, Lili Wei, Omid Semiari et al.

3GPP Release 18 cell discontinuous transmission and reception (cell DTX/DRX) is an important new network energy saving feature for 5G. As a time-domain technique, it periodically aggregates the user data transmissions in a given duration of time when the traffic load is not heavy, so that the remaining time can be kept silent and advanced sleep modes (ASM) can be enabled to shut down more radio components and save more energy for the cell. However, inevitably the packet delay is increased, as during the silent period no transmission is allowed. In this paper we study how to configure cell DTX/DRX to optimally balance energy saving and packet delay, so that for delay-sensitive traffic maximum energy saving can be achieved while the degradation of quality of service (QoS) is minimized. As the optimal configuration can be different for different network and traffic conditions, the problem is complex and we resort to deep reinforcement learning (DRL) framework to train an AI agent to solve it. Through careful design of 1) the learning algorithm, which implements a deep Q-network (DQN) on a contextual bandit (CB) model, and 2) the reward function, which utilizes a smooth approximation of a theoretically optimal but discontinuous reward function, we are able to train an AI agent that always tries to select the best possible Cell DTX/DRX configuration under any network and traffic conditions. Simulation results show that compared to the case when cell DTX/DRX is not used, our agent can achieve up to ~45% energy saving depending on the traffic load scenario, while always maintaining no more than ~1% QoS degradation.

SENov 17, 2025
Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming

Rufeng Chen, Shuaishuai Jiang, Jiyun Shen et al.

The rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI's ability to complete educational tasks and its impact on student performance, often overlooking its effects on knowledge gains. In this study, we investigate how GenAI assistance compares to conventional online resources in supporting knowledge gains across different proficiency levels. We conducted a controlled user experiment with 24 undergraduate students of two different levels of programming experience (beginner, intermediate) to examine how students interact with ChatGPT while solving programming tasks. We analyzed task performance, conceptual understanding, and interaction behaviors. Our findings reveal that generating complete solutions with GenAI significantly improves task performance, especially for beginners, but does not consistently result in knowledge gains. Importantly, usage strategies differ by experience: beginners tend to rely heavily on GenAI toward task completion often without knowledge gain in the process, while intermediates adopt more selective approaches. We find that both over-reliance and minimal use result in weaker knowledge gains overall. Based on our results, we call on students and educators to adopt GenAI as a learning rather than a problem solving tool. Our study highlights the urgent need for guidance when integrating GenAI into programming education to foster deeper understanding.

SESep 1, 2021
Characterizing and Detecting Configuration Compatibility Issues in Android Apps

Huaxun Huang, Ming Wen, Lili Wei et al.

XML configuration files are widely used in Android to define an app's user interface and essential runtime information such as system permissions. As Android evolves, it might introduce functional changes in the configuration environment, thus causing compatibility issues that manifest as inconsistent app behaviors at different API levels. Such issues can often induce software crashes and inconsistent look-and-feel when running at specific Android versions. Existing works incur plenty of false positive and false negative issue-detection rules by conducting trivial data-flow analysis while failing to model the XML tree hierarchies of the Android configuration files. Besides, little is known about how the changes in an Android framework can induce such compatibility issues. To bridge such gaps, we conducted a systematic study by analyzing 196 real-world issues collected from 43 popular apps. We identified common patterns of Android framework code changes that induce such configuration compatibility issues. Based on the findings, we propose \textsc{ConfDroid} that can automatically extract rules for detecting configuration compatibility issues. The intuition is to perform symbolic execution based on a model learned from the common code change patterns. Experiment results show that ConfDroid can successfully extract 282 valid issue-detection rules with a precision of 91.9%. Among them, 65 extracted rules can manifest issues that cannot be detected by the rules of state-of-the-art baselines. More importantly, 11 out of them have led to the detection of 107 reproducible configuration compatibility issues that the baselines cannot detect in 30 out of 316 real-world Android apps.

CRAug 24, 2021
Characterizing Transaction-Reverting Statements in Ethereum Smart Contracts

Lu Liu, Lili Wei, Wuqi Zhang et al.

Smart contracts are programs running on blockchain to execute transactions. When input constraints or security properties are violated at runtime, the transaction being executed by a smart contract needs to be reverted to avoid undesirable consequences. On Ethereum, the most popular blockchain that supports smart contracts, developers can choose among three transaction-reverting statements (i.e., require, if...revert, and if...throw) to handle anomalous transactions. While these transaction-reverting statements are vital for preventing smart contracts from exhibiting abnormal behaviors or suffering malicious attacks, there is limited understanding of how they are used in practice. In this work, we perform the first empirical study to characterize transaction-reverting statements in Ethereum smart contracts. We measured the prevalence of these statements in 3,866 verified smart contracts from popular dapps and built a taxonomy of their purposes via manually analyzing 557 transaction-reverting statements. We also compared template contracts and their corresponding custom contracts to understand how developers customize the use of transaction-reverting statements. Finally, we analyzed the security impact of transaction-reverting statements by removing them from smart contracts and comparing the mutated contracts against the original ones. Our study led to important findings, which can shed light on further research in the broad area of smart contract quality assurance and provide practical guidance to smart contract developers on the appropriate use of transaction-reverting statements.

SEJun 17, 2021
ÐArcher: Detecting On-Chain-Off-Chain Synchronization Bugs in Decentralized Applications

Wuqi Zhang, Lili Wei, Shuqing Li et al.

Since the emergence of Ethereum, blockchain-based decentralized applications (DApps) have become increasingly popular and important. To balance the security, performance, and costs, a DApp typically consists of two layers: an on-chain layer to execute transactions and store crucial data on the blockchain and an off-chain layer to interact with users. A DApp needs to synchronize its off-chain layer with the on-chain layer proactively. Otherwise, the inconsistent data in the off-chain layer could mislead users and cause undesirable consequences, e.g., loss of transaction fees. However, transactions sent to the blockchain are not guaranteed to be executed and could even be reversed after execution due to chain reorganization. Such non-determinism in the transaction execution is unique to blockchain. DApp developers may fail to perform the on-chain-off-chain synchronization accurately due to their lack of familiarity with the complex transaction lifecycle. In this work, we investigate the challenges of synchronizing on-chain and off-chain data in Ethereum-based DApps. We present two types of bugs that could result in inconsistencies between the on-chain and off-chain layers. To help detect such on-chain-off-chain synchronization bugs, we introduce a state transition model to guide the testing of DApps and propose two effective oracles to facilitate the automatic identification of bugs. We build the first testing framework, DArcher, to detect on-chain-off-chain synchronization bugs in DApps. We have evaluated DArcher on 11 popular real-world DApps. DArcher achieves high precision (99.3%), recall (87.6%), and accuracy (89.4%) in bug detection and significantly outperforms the baseline methods. It has found 15 real bugs in the 11 DApps. So far, six of the 15 bugs have been confirmed by the developers, and three have been fixed. These promising results demonstrate the usefulness of DArcher.