Haipeng Cai

SE
h-index12
13papers
286citations
Novelty48%
AI Score42

13 Papers

SEAug 5, 2024
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future

Haolin Jin, Linghan Huang, Haipeng Cai et al.

With the rise of large language models (LLMs), researchers are increasingly exploring their applications in var ious vertical domains, such as software engineering. LLMs have achieved remarkable success in areas including code generation and vulnerability detection. However, they also exhibit numerous limitations and shortcomings. LLM-based agents, a novel tech nology with the potential for Artificial General Intelligence (AGI), combine LLMs as the core for decision-making and action-taking, addressing some of the inherent limitations of LLMs such as lack of autonomy and self-improvement. Despite numerous studies and surveys exploring the possibility of using LLMs in software engineering, it lacks a clear distinction between LLMs and LLM based agents. It is still in its early stage for a unified standard and benchmarking to qualify an LLM solution as an LLM-based agent in its domain. In this survey, we broadly investigate the current practice and solutions for LLMs and LLM-based agents for software engineering. In particular we summarise six key topics: requirement engineering, code generation, autonomous decision-making, software design, test generation, and software maintenance. We review and differentiate the work of LLMs and LLM-based agents from these six topics, examining their differences and similarities in tasks, benchmarks, and evaluation metrics. Finally, we discuss the models and benchmarks used, providing a comprehensive analysis of their applications and effectiveness in software engineering. We anticipate this work will shed some lights on pushing the boundaries of LLM-based agents in software engineering for future research.

SEAug 7, 2024
VulScribeR: Exploring RAG-based Vulnerability Augmentation with LLMs

Seyed Shayan Daneshvar, Yu Nong, Xu Yang et al.

Detecting vulnerabilities is vital for software security, yet deep learning-based vulnerability detectors (DLVD) face a data shortage, which limits their effectiveness. Data augmentation can potentially alleviate the data shortage, but augmenting vulnerable code is challenging and requires a generative solution that maintains vulnerability. Previous works have only focused on generating samples that contain single statements or specific types of vulnerabilities. Recently, large language models (LLMs) have been used to solve various code generation and comprehension tasks with inspiring results, especially when fused with retrieval augmented generation (RAG). Therefore, we propose VulScribeR, a novel LLM-based solution that leverages carefully curated prompt templates to augment vulnerable datasets. More specifically, we explore three strategies to augment both single and multi-statement vulnerabilities, with LLMs, namely Mutation, Injection, and Extension. Our extensive evaluation across four vulnerability datasets and DLVD models, using three LLMs, show that our approach beats two SOTA methods Vulgen and VGX, and Random Oversampling (ROS) by 27.48%, 27.93%, and 15.41% in f1-score with 5K generated vulnerable samples on average, and 53.84%, 54.10%, 69.90%, and 40.93% with 15K generated vulnerable samples. Our approach demonstrates its feasibility for large-scale data augmentation by generating 1K samples at as cheap as US$ 1.88.

SENov 24, 2025Code
DUALGUAGE: Automated Joint Security-Functionality Benchmarking for Secure Code Generation

Abhijeet Pathak, Suvadra Barua, Dinesh Gudimetla et al.

Large language models (LLMs) and autonomous coding agents are increasingly used to generate software across a wide range of domains. Yet a core requirement remains unmet: ensuring that generated code is secure without compromising its functional correctness. Existing benchmarks and evaluations for secure code generation fall short-many measure only vulnerability reduction, disregard correctness preservation, or evaluate security and functionality on separate datasets, violating the fundamental need for simultaneous joint evaluation. We present DUALGAUGE, the first fully automated benchmarking framework designed to rigorously evaluate the security and correctness of LLM-generated code in unison. Given the lack of datasets enabling joint evaluation of secure code generation, we also present DUALGAUGE-BENCH, a curated benchmark suite of diverse coding tasks, each paired with manually validated test suites for both security and functionality, designed for full coverage of specification requirements. At the core of DUALGAUGE is an agentic program executor, which runs a program against given tests in sandboxed environments, and an LLM-based evaluator, which assesses both correctness and vulnerability behavior against expected outcomes. We rigorously evaluated and ensured the quality of DUALGAUGE-BENCH and the accuracy of DUALGAUGE, and applied DUALGAUGE to benchmarking ten leading LLMs on DUALGAUGE-BENCH across thousands of test scenarios. Our results reveal critical gaps in correct and secure code generation by these LLMs, for which our open-source system and datasets help accelerate progress via reproducible, scalable, and rigorous evaluation.

CRMay 10, 2025
System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection

Zongze Li, Jiawei Guo, Haipeng Cai

Large language models (LLMs) have gained widespread adoption across diverse applications due to their impressive generative capabilities. Their plug-and-play nature enables both developers and end users to interact with these models through simple prompts. However, as LLMs become more integrated into various systems in diverse domains, concerns around their security are growing. Existing studies mainly focus on threats arising from user prompts (e.g. prompt injection attack) and model output (e.g. model inversion attack), while the security of system prompts remains largely overlooked. This work bridges the critical gap. We introduce system prompt poisoning, a new attack vector against LLMs that, unlike traditional user prompt injection, poisons system prompts hence persistently impacts all subsequent user interactions and model responses. We systematically investigate four practical attack strategies in various poisoning scenarios. Through demonstration on both generative and reasoning LLMs, we show that system prompt poisoning is highly feasible without requiring jailbreak techniques, and effective across a wide range of tasks, including those in mathematics, coding, logical reasoning, and natural language processing. Importantly, our findings reveal that the attack remains effective even when user prompts employ advanced prompting techniques like chain-of-thought (CoT). We also show that such techniques, including CoT and retrieval-augmentation-generation (RAG), which are proven to be effective for improving LLM performance in a wide range of tasks, are significantly weakened in their effectiveness by system prompt poisoning.

CROct 8, 2025
Fortifying LLM-Based Code Generation with Graph-Based Reasoning on Secure Coding Practices

Rupam Patir, Keyan Guo, Haipeng Cai et al.

The code generation capabilities of Large Language Models (LLMs) have transformed the field of software development. However, this advancement also presents significant security challenges, as LLM-generated code often contains vulnerabilities. One direction of research strengthens LLMs by injecting or refining security knowledge through curated datasets, model tuning, or static analyzers. While effective in certain settings, these methods can be resource-intensive, less adaptable to zero-day vulnerabilities, and often inapplicable to proprietary models. To address these challenges, we introduce GRASP, which explores a new direction that focuses on structured reasoning over Secure Coding Practices(SCPs) rather than additional training or external feedback. GRASP comprises two key ideas: (1) an SCP graph that organizes SCPs into a Directed Acyclic Graph (DAG) capturing dependencies and relationships, and (2) a graph-based reasoning process that systematically guides LLMs through relevant SCPs for code generation. This design enables interpretable, model-agnostic, and scalable security improvements, particularly for previously unseen vulnerabilities. Our evaluation shows that GRASP consistently achieves Security Rates (SR) exceeding 80% across multiple LLMs, and delivers up to 88% improvements over baselines on zero-day vulnerabilities.

SENov 8, 2021
D$^2$ABS: A Framework for Dynamic Dependence Abstraction of Distributed Programs

Haipeng Cai, Xiaoqin Fu

As modern software systems are increasingly developed for running in distributed environments, it is crucial to provide fundamental techniques such as dependence analysis for checking, diagnosing, and evolving those systems. However, traditional dependence analysis is either inapplicable or of very limited utility for distributed programs due to the decoupled components of these programs that run in concurrent processes at physically separated machines. Motivated by the need for dependence analysis of distributed software and the diverse cost-effectiveness needs of dependence-based applications, this paper presents D$^2$ABS, a framework of dynamic dependence abstraction for distributed programs. By partial-ordering distributed method-execution events and inferring causality from the ordered events, D$^2$ABS abstracts method-level dependencies both within and across process boundaries. Further, by exploiting message-passing semantics across processes, and incorporating static dependencies and statement coverage within individual components, we present three additional instantiations of D$^2$ABS that trade efficiency for better precision. We present the design of the D$^2$ABS framework and evaluate the four instantiations of D$^2$ABS on distributed systems of various architectures and scales using our implementation for Java. Our empirical results show that D$^2$ABS is significantly more effective than existing options while offering varied levels of cost-effectiveness tradeoffs. As our framework essentially computes whole-system run-time dependencies, it naturally empowers a range of other dependence-based applications.

SEMar 15, 2021
EnHMM: On the Use of Ensemble HMMs and Stack Traces to Predict the Reassignment of Bug Report Fields

Md Shariful Islam, Abdelwahab Hamou-Lhadj, Korosh K. Sabor et al.

Bug reports (BR) contain vital information that can help triaging teams prioritize and assign bugs to developers who will provide the fixes. However, studies have shown that BR fields often contain incorrect information that need to be reassigned, which delays the bug fixing process. There exist approaches for predicting whether a BR field should be reassigned or not. These studies use mainly BR descriptions and traditional machine learning algorithms (SVM, KNN, etc.). As such, they do not fully benefit from the sequential order of information in BR data, such as function call sequences in BR stack traces, which may be valuable for improving the prediction accuracy. In this paper, we propose a novel approach, called EnHMM, for predicting the reassignment of BR fields using ensemble Hidden Markov Models (HMMs), trained on stack traces. EnHMM leverages the natural ability of HMMs to represent sequential data to model the temporal order of function calls in BR stack traces. When applied to Eclipse and Gnome BR repositories, EnHMM achieves an average precision, recall, and F-measure of 54%, 76%, and 60% on Eclipse dataset and 41%, 69%, and 51% on Gnome dataset. We also found that EnHMM improves over the best single HMM by 36% for Eclipse and 76% for Gnome. Finally, when comparing EnHMM to Im.ML.KNN, a recent approach in the field, we found that the average F-measure score of EnHMM improves the average F-measure of Im.ML.KNN by 6.80% and improves the average recall of Im.ML.KNN by 36.09%. However, the average precision of EnHMM is lower than that of Im.ML.KNN (53.93% as opposed to 56.71%).

SEFeb 25, 2021
A Lightweight Approach of Human-Like Playtesting

Yan Zhao, Weihao Zhang, Enyi Tang et al.

A playtest is the process in which human testers are recruited to play video games and to reveal software bugs. Manual testing is expensive and time-consuming, especially when there are many mobile games to test and every software version requires for extensive testing before being released. Existing testing frameworks (e.g., Android Monkey) are limited because they adopt no domain knowledge to play games. Learning-based tools (e.g., Wuji) involve a huge amount of training data and computation before testing any game. This paper presents LIT -- our lightweight approach to generalize playtesting tactics from manual testing, and to adopt the generalized tactics to automate game testing. LIT consists of two phases. In Phase I, while a human plays an Android game app G for a short period of time (e.g., eight minutes), \tool records the user's actions (e.g., swipe) and the scene before each action. Based on the collected data, LIT generalizes a set of \emph{context-aware, abstract playtesting tactics} which describe under what circumstances, what actions can be taken to play the game. In Phase II, LIT tests G based on the generalized tactics. Namely, given a randomly generated game scene, LIT searches match for the abstract context of any inferred tactic; if there is a match, LIT customizes the tactic and generates a feasible event to play the game. Our evaluation with nine games shows LIT to outperform two state-of-the-art tools. This implies that by automating playtest, LIT will significantly reduce manual testing and boost the quality of game apps.

CRJul 22, 2018
A Preliminary Study On the Sustainability of Android Malware Detection

Haipeng Cai

Machine learning-based malware detection dominates current security defense approaches for Android apps. However, due to the evolution of Android platforms and malware, existing such techniques are widely limited by their need for constant retraining that are costly, and reliance on new malware samples that may not be timely available. As a result, new and emerging malware slips through, as seen from the continued surging of malware in the wild. Thus, a more practical detector needs not only to be accurate but, more critically, to be able to sustain its capabilities over time without frequent retraining. In this paper, we study how Android apps evolve as a population over time, in terms of their behaviors related to accesses to sensitive information and operations. We first perform a longitudinal characterization of 6K benign and malicious apps developed across seven years, with focus on these sensitive accesses in app executions. Our study reveals, during the long evolution, a consistent, clear differentiation between malware and benign apps regarding such accesses, measured by relative statistics of relevant method calls. Following these findings, we developed DroidSpan, a novel classification system based on a new behavioral profile for Android apps. Through an extensive evaluation, we showed that DroidSpan can not only effectively detect malware but sustain high detection accuracy (93% F1 measure) for four years (with 81% F1 for five years). Through a dedicated study, we also showed its resiliency to sophisticated evasion schemes. By comparing to a state-of-the-art malware detector, we demonstrated the largely superior sustainability of our approach at reasonable costs.

SEApr 15, 2016
DISTEA: Efficient Dynamic Impact Analysis for Distributed Systems

Haipeng Cai, Douglas Thain

Dynamic impact analysis is a fundamental technique for understanding the impact of specific program entities, or changes to them, on the rest of the program for concrete executions. However, existing techniques are either inapplicable or of very limited utility for distributed programs running in multiple concurrent processes. This paper presents DISTEA, a technique and tool for dynamic impact analysis of distributed systems. By partially ordering distributed method-execution events and inferring causality from the ordered events, DISTEA can predict impacts propagated both within and across process boundaries. We implemented DISTEA for Java and applied it to four distributed programs of various types and sizes, including two enterprise systems. We also evaluated the precision and practical usefulness of DISTEA, and demonstrated its application in program comprehension, through two case studies. The results show that DISTEA is highly scalable, more effective than existing alternatives, and instrumental to understanding distributed systems and their executions.

SEFeb 23, 2015
Enhancing Programming Interface to Effectively Meet Multiple Information Needs of Developers

Haipeng Cai

In the past decades, integrated development environments (IDEs) have been largely advanced to facilitate common software engineering tasks. Yet, with growing information needs driven by increasing complexity in developing modern high-quality software, developers often need to switch among multiple user interfaces, even across different applications, in their development process, which breaks their mental workflow thus tends to adversely affect their working efficiency and productivity. This position paper discusses challenges faced by current IDE designs mainly from working context transitions of developers during the process of seeking multiple information needs for their development tasks. It remarks the primary blockades behind and initially explores some high-level design considerations for overcoming such challenges in the next-generation IDEs. Specifically, a few design enhancements on top of modern IDEs are envisioned, attempting to reduce the overheads of frequent context switching commonly seen in the multitasking of developers.

DCOct 11, 2013
Depth-dependent Parallel Visualization with 3D Stylized Dense Tubes

Haipeng Cai, Jian Chen, Alexander P. Auchus

We present a parallel visualization algorithm for the illustrative rendering of depth-dependent stylized dense tube data at interactive frame rates. While this computation could be efficiently performed on a GPU device, we target a parallel framework to enable it to be efficiently running on an ordinary multi-core CPU platform which is much more available than GPUs for common users. Our approach is to map the depth information in each tube onto each of the visual dimensions of shape, color, texture, value, and size on the basis of Bertin's semiology theory. The purpose is to enable more legible displays in the dense tube environments. A major contribution of our work is an efficient and effective parallel depthordering algorithm that makes use of the message passing interface (MPI) with VTK. We evaluated our framework with visualizations of depth-stylized tubes derived from 3D diffusion tensor MRI data by comparing its efficiency with several other alternative parallelization platforms running the same computations. As our results show, the parallelization framework we proposed can efficiently render highly dense 3D data sets like the tube data and thus is useful as a complement to parallel visualization environments that rely on GPUs.

GROct 10, 2013
Composing DTI Visualizations with End-user Programming

Haipeng Cai, Jian Chen, Alexander P. Auchus et al.

We present the design and prototype implementation of a scientific visualization language called Zifazah for composing 3D visualizations of diffusion tensor magnetic resonance imaging (DT-MRI or DTI) data. Unlike existing tools allowing flexible customization of data visualizations that are programmer-oriented, we focus on domain scientists as end users in order to enable them to freely compose visualizations of their scientific data set. We analyzed end-user descriptions extracted from interviews with neurologists and physicians conducting clinical practices using DTI about how they would build and use DTI visualizations to collect syntax and semantics for the language design, and have discovered the elements and structure of the proposed language. Zifazah makes use of the initial set of lexical terms and semantics to provide a declarative language in the spirit of intuitive syntax and usage. This work contributes three, among others, main design principles for scientific visualization language design as well as a practice of such language for DTI visualization with Zifazah. First, Zifazah incorporated visual symbolic mapping based on color, size and shape, which is a sub-set of Bertin's taxonomy migrated to scientific visualizations. Second, Zifazah is defined as a spatial language whereby lexical representation of spatial relationship for 3D object visualization and manipulations, which is characteristic of scientific data, can be programmed. Third, built on top of Bertin's semiology, flexible data encoding specifically for scientific visualizations is integrated in our language in order to allow end users to achieve optimal visual composition at their best. Along with sample scripts representative of our language design features, some new DTI visualizations as the running results created by end users using the novel visualization language have also been presented.