Fan Long

CR
h-index36
8papers
209citations
Novelty58%
AI Score55

8 Papers

23.5CRApr 20
Enforcing Control Flow Integrity on DeFi Smart Contracts

Zhiyang Chen, Sidi Mohamed Beillahi, Pasha Barahimi et al.

Smart contracts power decentralized financial (DeFi) services but are vulnerable to security exploits that can lead to significant financial losses. Existing security measures often fail to adequately protect these contracts due to the composability of DeFi protocols and the increasing sophistication of attacks. Through a large-scale empirical study of historical transactions from the 37 hacked DeFi protocols, we discovered that while benign transactions typically exhibit a limited number of unique control flows, in stark contrast, attack transactions consistently introduce novel, previously unobserved control flows. Building on these insights, we developed CrossGuard, a novel framework that enforces control flow integrity onchain to secure smart contracts. Crucially, CrossGuard does not require prior knowledge of specific hacks. Instead, configured only once at deployment, it enforces control flow whitelisting policies and applies simplification heuristics at runtime. This approach monitors and prevents potential attacks by reverting all transactions that do not adhere to the established control flow whitelisting rules. Our evaluation demonstrates that CrossGuard effectively blocks 35 of the 37 analyzed attacks when configured only once at contract deployment, maintaining a low false positive rate of 0.26% and minimal additional gas costs. These results underscore the efficacy of applying control flow integrity to smart contracts, significantly enhancing security beyond traditional methods and addressing the evolving threat landscape in the DeFi ecosystem.

SEJul 28, 2025Code
TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Honghua Dong, Jiacheng Yang, Xun Deng et al.

Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduce TypyBench, a benchmark designed to evaluate LLMs' type inference across entire Python repositories. TypyBench features two novel metrics: TypeSim, which captures nuanced semantic relationships between predicted and ground truth types, and TypeCheck, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent TypeSim scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. TypyBench provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts. Our code and data are available at https://github.com/typybench/typybench.

77.7LGApr 1
CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Tara Saba, Anne Ouyang, Xujie Si et al.

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate--test--refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based validation, structured debugging, and staged optimization. A key design choice is to generate kernels using the CuTe abstraction layer, which exposes performance-critical structures such as tiling and data movement while providing a more stable representation for iterative modification. To guide performance improvement, CuTeGen incorporates workload-aware optimization prompts and delayed integration of profiling feedback. Experimental results on matrix multiplication and activation workloads demonstrate that the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.

CRSep 2, 2025
Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs

Zhiyang Chen, Tara Saba, Xun Deng et al.

Large Language Models (LLMs) have become critical to modern software development, but their reliance on uncurated web-scale datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To systematically evaluate this risk, we introduce Scam2Prompt, a scalable automated auditing framework that identifies the underlying intent of a scam site and then synthesizes innocuous, developer-style prompts that mirror this intent, allowing us to test whether an LLM will generate malicious code in response to these innocuous prompts. In a large-scale study of four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), we found that Scam2Prompt's innocuous prompts triggered malicious URL generation in 4.24% of cases. To test the persistence of this security risk, we constructed Innoc2Scam-bench, a benchmark of 1,559 innocuous prompts that consistently elicited malicious code from all four initial LLMs. When applied to seven additional production LLMs released in 2025, we found the vulnerability is not only present but severe, with malicious code generation rates ranging from 12.7% to 43.8%. Furthermore, existing safety measures like state-of-the-art guardrails proved insufficient to prevent this behavior, with an overall detection rate of less than 0.3%.

PLFeb 22, 2021
SigVM: Enabling Event-Driven Execution for Autonomous Smart Contracts

Zihan Zhao, Sidi Mohamed Beillahi, Ryan Song et al.

This paper presents SigVM, a novel blockchain virtual machine that supports an event-driven execution model, enabling developers to build autonomous smart contracts. Contracts in SigVM can emit signal events, on which other contracts can listen. Once an event is triggered, corresponding handler functions are automatically executed as signal transactions. We build an end-to-end blockchain platform SigChain and a contract language compiler SigSolid to realize the potential of SigVM. Experimental results show that our benchmark applications can be reimplemented with SigVM in an autonomous way, eliminating the dependency on unreliable mechanisms like off-chain relay servers. The development effort of reimplementing these contracts with SigVM is small, i.e., we modified on average 2.6% of the contract code.

CRJun 1, 2020
GHAST: Breaking Confirmation Delay Barrier in Nakamoto Consensus via Adaptive Weighted Blocks

Chenxing Li, Fan Long, Guang Yang

Initiated from Nakamoto's Bitcoin system, blockchain technology has demonstrated great capability of building secure consensus among decentralized parties at Internet-scale, i.e., without relying on any centralized trusted party. Nowadays, blockchain systems find applications in various fields. But the performance is increasingly becoming a bottleneck, especially when permissionless participation is retained for full decentralization. In this work, we present a new consensus protocol named GHAST (Greedy Heaviest Adaptive Sub-Tree) which organizes blocks in a Tree-Graph structure (i.e., a directed acyclic graph (DAG) with a tree embedded) that allows fast and concurrent block generation. GHAST protocol simultaneously achieves a logarithmically bounded liveness guarantee and low confirmation latency. More specifically, for maximum latency $d$ and adversarial computing power bounded away from 50\%, GHAST guarantees confirmation with confidence $\ge 1-\varepsilon$ after a time period of $O(d\cdot \log(1/\varepsilon))$. When there is no observable attack, GHAST only needs $3d$ time to achieve confirmation at the same confidence level as six-block-confirmation in Bitcoin, while it takes roughly $360d$ in Bitcoin.

CRDec 18, 2018
Detecting Standard Violation Errors in Smart Contracts

Ao Li, Fan Long

We present SOLAR, a new analysis tool for automatically detecting standard violation errors in Ethereum smart contracts.Given the Ethereum Virtual Machine (EVM) bytecode of a smart contract and a user specified constraint or invariant derived from a technical standard such as ERC-20,SOLAR symbolically executes the contract, explores all possible execution paths, and checks whether it is possible to initiate a sequence of malicious transactions to violate the specified constraint or invariant. Our experimental results highlight the effectiveness of SOLAR in finding new errors in smart con-tracts. Out of the evaluated 779 ERC-20 and 310 ERC-721smart contracts, SOLAR found 255 standard violation errors in 197 vulnerable contracts with only three false positives.237 out of the 255 errors are zero-day errors that are not re-ported before. Our results sound the alarm on the prevalence of standard violation errors in critical smart contracts that manipulate publicly traded digital assets

SEFeb 18, 2016
An Analysis of the Search Spaces for Generate and Validate Patch Generation Systems

Fan Long, Martin Rinard

We present the first systematic analysis of the characteristics of patch search spaces for automatic patch generation systems. We analyze the search spaces of two current state-of-the-art systems, SPR and Prophet, with 16 different search space configurations. Our results are derived from an analysis of 1104 different search spaces and 768 patch generation executions. Together these experiments consumed over 9000 hours of CPU time on Amazon EC2. The analysis shows that 1) correct patches are sparse in the search spaces (typically at most one correct patch per search space per defect), 2) incorrect patches that nevertheless pass all of the test cases in the validation test suite are typically orders of magnitude more abundant, and 3) leveraging information other than the test suite is therefore critical for enabling the system to successfully isolate correct patches. We also characterize a key tradeoff in the structure of the search spaces. Larger and richer search spaces that contain correct patches for more defects can actually cause systems to find fewer, not more, correct patches. We identify two reasons for this phenomenon: 1) increased validation times because of the presence of more candidate patches and 2) more incorrect patches that pass the test suite and block the discovery of correct patches. These fundamental properties, which are all characterized for the first time in this paper, help explain why past systems often fail to generate correct patches and help identify challenges, opportunities, and productive future directions for the field.