Zhengzi Xu

AI
h-index22
5papers
955citations
Novelty52%
AI Score43

5 Papers

CRAug 7, 2023
GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis

Yuqiang Sun, Daoyuan Wu, Yue Xue et al.

Smart contracts are prone to various vulnerabilities, leading to substantial financial losses over time. Current analysis tools mainly target vulnerabilities with fixed control or data-flow patterns, such as re-entrancy and integer overflow. However, a recent study on Web3 security bugs revealed that about 80% of these bugs cannot be audited by existing tools due to the lack of domain-specific property description and checking. Given recent advances in Large Language Models (LLMs), it is worth exploring how Generative Pre-training Transformer (GPT) could aid in detecting logicc vulnerabilities. In this paper, we propose GPTScan, the first tool combining GPT with static analysis for smart contract logic vulnerability detection. Instead of relying solely on GPT to identify vulnerabilities, which can lead to high false positives and is limited by GPT's pre-trained knowledge, we utilize GPT as a versatile code understanding tool. By breaking down each logic vulnerability type into scenarios and properties, GPTScan matches candidate vulnerabilities with GPT. To enhance accuracy, GPTScan further instructs GPT to intelligently recognize key variables and statements, which are then validated by static confirmation. Evaluation on diverse datasets with around 400 contract projects and 3K Solidity files shows that GPTScan achieves high precision (over 90%) for token contracts and acceptable precision (57.14%) for large projects like Web3Bugs. It effectively detects ground-truth logic vulnerabilities with a recall of over 70%, including 9 new vulnerabilities missed by human auditors. GPTScan is fast and cost-effective, taking an average of 14.39 seconds and 0.01 USD to scan per thousand lines of Solidity code. Moreover, static confirmation helps GPTScan reduce two-thirds of false positives.

SEMar 29
Understanding NPM Malicious Package Detection: A Benchmark-Driven Empirical Analysis

Wenbo Guo, Zhongwen Chen, Zhengzi Xu et al.

The NPM ecosystem has become a primary target for software supply chain attacks, yet existing detection tools are evaluated in isolation on incompatible datasets, making cross-tool comparison unreliable. We conduct a benchmark-driven empirical analysis of NPM malware detection, building a dataset of 6,420 malicious and 7,288 benign packages annotated with 11 behavior categories and 8 evasion techniques, and evaluating 8 tools across 13 variants. Unlike prior work, we complement quantitative evaluation with source-code inspection of each tool to expose the structural mechanisms behind its performance. Our analysis reveals five key findings. Tool precision-recall positions are structurally determined by how each tool resolves the ambiguity between what code can do and what it intends to do, with GuardDog achieving the best balance at 93.32% F1. A single API call carries no directional intent, but a behavioral chain such as collecting environment variables, serializing, and exfiltrating disambiguates malicious purpose, raising SAP_DT detection from 3.2% to 79.3%. Most malware requires no evasion because the ecosystem lacks mandatory pre-publication scanning. ML degradation stems from concept convergence rather than concept drift: malware became simpler and statistically indistinguishable from benign code in feature space. Tool combination effectiveness is governed by complementarity minus false-positive introduction, not paradigm diversity, with strategic combinations reaching 96.08% accuracy and 95.79% F1. Our benchmark and evaluation framework are publicly available.

AIApr 26, 2025
A Vision for Auto Research with LLM Agents

Chengwei Liu, Chong Wang, Jiayue Cao et al.

This paper introduces Agent-Based Auto Research, a structured multi-agent framework designed to automate, coordinate, and optimize the full lifecycle of scientific research. Leveraging the capabilities of large language models (LLMs) and modular agent collaboration, the system spans all major research phases, including literature review, ideation, methodology planning, experimentation, paper writing, peer review response, and dissemination. By addressing issues such as fragmented workflows, uneven methodological expertise, and cognitive overload, the framework offers a systematic and scalable approach to scientific inquiry. Preliminary explorations demonstrate the feasibility and potential of Auto Research as a promising paradigm for self-improving, AI-driven research processes.

SEMay 23, 2023
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu, Gelei Deng, Zhengzi Xu et al.

Large Language Models (LLMs), like ChatGPT, have demonstrated vast potential but also introduce challenges related to content constraints and potential misuse. Our study investigates three key research questions: (1) the number of different prompt types that can jailbreak LLMs, (2) the effectiveness of jailbreak prompts in circumventing LLM constraints, and (3) the resilience of ChatGPT against these jailbreak prompts. Initially, we develop a classification model to analyze the distribution of existing prompts, identifying ten distinct patterns and three categories of jailbreak prompts. Subsequently, we assess the jailbreak capability of prompts with ChatGPT versions 3.5 and 4.0, utilizing a dataset of 3,120 jailbreak questions across eight prohibited scenarios. Finally, we evaluate the resistance of ChatGPT against jailbreak prompts, finding that the prompts can consistently evade the restrictions in 40 use-case scenarios. The study underscores the importance of prompt structures in jailbreaking LLMs and discusses the challenges of robust jailbreak prompt generation and prevention.

AIJul 1, 2021
Cross-Lingual Transfer Learning for Statistical Type Inference

Zhiming Li, Xiaofei Xie, Haoliang Li et al.

Hitherto statistical type inference systems rely thoroughly on supervised learning approaches, which require laborious manual effort to collect and label large amounts of data. Most Turing-complete imperative languages share similar control- and data-flow structures, which make it possible to transfer knowledge learned from one language to another. In this paper, we propose a cross-lingual transfer learning framework, PLATO, for statistical type inference, which allows us to leverage prior knowledge learned from the labeled dataset of one language and transfer it to the others, e.g., Python to JavaScript, Java to JavaScript, etc. PLATO is powered by a novel kernelized attention mechanism to constrain the attention scope of the backbone Transformer model such that the model is forced to base its prediction on commonly shared features among languages. In addition, we propose the syntax enhancement that augments the learning on the feature overlap among language domains. Furthermore, PLATO can also be used to improve the performance of the conventional supervised learning-based type inference by introducing cross-lingual augmentation, which enables the model to learn more general features across multiple languages. We evaluated PLATO under two settings: 1) under the cross-domain scenario that the target language data is not labeled or labeled partially, the results show that PLATO outperforms the state-of-the-art domain transfer techniques by a large margin, e.g., it improves the Python to TypeScript baseline by +5.40%@EM, +5.40%@weighted-F1, and 2) under the conventional monolingual supervised learning based scenario, PLATO improves the Python baseline by +4.40%@EM, +3.20%@EM (parametric).