Jiangshan Yu

CR
h-index29
16papers
556citations
Novelty49%
AI Score44

16 Papers

DBDec 22, 2022
TxAllo: Dynamic Transaction Allocation in Sharded Blockchain Systems

Yuanzhe Zhang, Shirui Pan, Jiangshan Yu

The scalability problem has been one of the most significant barriers limiting the adoption of blockchains. Blockchain sharding is a promising approach to this problem. However, the sharding mechanism introduces a significant number of cross-shard transactions, which are expensive to process. This paper focuses on the transaction allocation problem to reduce the number of cross-shard transactions for better scalability. In particular, we systematically formulate the transaction allocation problem and convert it to the community detection problem on a graph. A deterministic and fast allocation scheme TxAllo is proposed to dynamically infer the allocation of accounts and their associated transactions. It directly optimizes the system throughput, considering both the number of cross-shard transactions and the workload balance among shards. We evaluate the performance of TxAllo on an Ethereum dataset containing over 91 million transactions. Our evaluation results show that for a blockchain with 60 shards, TxAllo reduces the cross-shard transaction ratio from 98% (by using traditional hash-based allocation) to about 12%. In the meantime, the workload balance is well maintained. Compared with other methods, the execution time of TxAllo is almost negligible. For example, when updating the allocation every hour, the execution of TxAllo only takes 0.5 seconds on average, whereas other concurrent works, such as BrokerChain (INFOCOM'22) leveraging the classic METIS method, require 422 seconds.

CRDec 2, 2025
Leveraging Large Language Models to Bridge On-chain and Off-chain Transparency in Stablecoins

Yuexin Xiang, Yuchen Lei, SM Mahir Shazeed Rish et al.

Stablecoins such as USDT and USDC aspire to peg stability by coupling issuance controls with reserve attestations. In practice, however, the transparency is split across two worlds: verifiable on-chain traces and off-chain disclosures locked in unstructured text that are unconnected. We introduce a large language model (LLM)-based automated framework that bridges these two dimensions by aligning on-chain issuance data with off-chain disclosure statements. First, we propose an integrative framework using LLMs to capture and analyze on- and off-chain data through document parsing and semantic alignment, extracting key financial indicators from issuer attestations and mapping them to corresponding on-chain metrics. Second, we integrate multi-chain issuance records and disclosure documents within a model context protocol (MCP) framework that standardizes LLMs access to both quantitative market data and qualitative disclosure narratives. This framework enables unified retrieval and contextual alignment across heterogeneous stablecoin information sources and facilitates consistent analysis. Third, we demonstrate the capability of LLMs to operate across heterogeneous data modalities in blockchain analytics, quantifying discrepancies between reported and observed circulation and examining their implications for cross-chain transparency and price dynamics. Our findings reveal systematic gaps between disclosed and verifiable data, showing that LLM-assisted analysis enhances cross-modal transparency and supports automated, data-driven auditing in decentralized finance (DeFi).

MLSep 18, 2023
New Bounds on the Accuracy of Majority Voting for Multi-Class Classification

Sina Aeeneh, Nikola Zlatanov, Jiangshan Yu

Majority voting is a simple mathematical function that returns the value that appears most often in a set. As a popular decision fusion technique, the majority voting function (MVF) finds applications in resolving conflicts, where a number of independent voters report their opinions on a classification problem. Despite its importance and its various applications in ensemble learning, data crowd-sourcing, remote sensing, and data oracles for blockchains, the accuracy of the MVF for the general multi-class classification problem has remained unknown. In this paper, we derive a new upper bound on the accuracy of the MVF for the multi-class classification problem. More specifically, we show that under certain conditions, the error rate of the MVF exponentially decays toward zero as the number of independent voters increases. Conversely, the error rate of the MVF exponentially grows with the number of independent voters if these conditions are not met. We first explore the problem for independent and identically distributed voters where we assume that every voter follows the same conditional probability distribution of voting for different classes, given the true classification of the data point. Next, we extend our results for the case where the voters are independent but non-identically distributed. Using the derived results, we then provide a discussion on the accuracy of the truth discovery algorithms. We show that in the best-case scenarios, truth discovery algorithms operate as an amplified MVF and thereby achieve a small error rate only when the MVF achieves a small error rate, and vice versa, achieve a large error rate when the MVF also achieves a large error rate. In the worst-case scenario, the truth discovery algorithms may achieve a higher error rate than the MVF. Finally, we confirm our theoretical results using numerical simulations.

IRMay 31, 2021Code
A Bytecode-based Approach for Smart Contract Classification

Chaochen Shi, Yong Xiang, Robin Ram Mohan Doss et al.

With the development of blockchain technologies, the number of smart contracts deployed on blockchain platforms is growing exponentially, which makes it difficult for users to find desired services by manual screening. The automatic classification of smart contracts can provide blockchain users with keyword-based contract searching and helps to manage smart contracts effectively. Current research on smart contract classification focuses on Natural Language Processing (NLP) solutions which are based on contract source code. However, more than 94% of smart contracts are not open-source, so the application scenarios of NLP methods are very limited. Meanwhile, NLP models are vulnerable to adversarial attacks. This paper proposes a classification model based on features from contract bytecode instead of source code to solve these problems. We also use feature selection and ensemble learning to optimize the model. Our experimental studies on over 3,300 real-world Ethereum smart contracts show that our model can classify smart contracts without source code and has better performance than baseline models. Our model also has good resistance to adversarial attacks compared with NLP-based models. In addition, our analysis reveals that account features used in many smart contract classification models have little effect on classification and can be excluded.

CRDec 11, 2020Code
SoK: Diving into DAG-based Blockchain Systems

Qin Wang, Jiangshan Yu, Shiping Chen et al.

Blockchain plays an important role in cryptocurrency markets and technology services. However, limitations on high latency and low scalability retard their adoptions and applications in classic designs. Reconstructed blockchain systems have been proposed to avoid the consumption of competitive transactions caused by linear sequenced blocks. These systems, instead, structure transactions/blocks in the form of Directed Acyclic Graph (DAG) and consequently re-build upper layer components including consensus, incentives, \textit{etc.} The promise of DAG-based blockchain systems is to enable fast confirmation (complete transactions within million seconds) and high scalability (attach transactions in parallel) without significantly compromising security. However, this field still lacks systematic work that summarises the DAG technique. To bridge the gap, this Systematization of Knowledge (SoK) provides a comprehensive analysis of DAG-based blockchain systems. Through deconstructing open-sourced systems and reviewing academic researches, we conclude the main components and featured properties of systems, and provide the approach to establish a DAG. With this in hand, we analyze the security and performance of several leading systems, followed by discussions and comparisons with concurrent (scaling blockchain) techniques. We further identify open challenges to highlight the potentiality of DAG-based solutions and indicate their promising directions for future research.

CRFeb 1
TxRay: Agentic Postmortem of Live Blockchain Attacks

Ziyue Wang, Jiangshan Yu, Kaihua Qin et al.

Decentralized Finance (DeFi) has turned blockchains into financial infrastructure, allowing anyone to trade, lend, and build protocols without intermediaries, but this openness exposes pools of value controlled by code. Within five years, the DeFi ecosystem has lost over 15.75B USD to reported exploits. Many exploits arise from permissionless opportunities that any participant can trigger using only public state and standard interfaces, which we call Anyone-Can-Take (ACT) opportunities. Despite on-chain transparency, postmortem analysis remains slow and manual: investigations start from limited evidence, sometimes only a single transaction hash, and must reconstruct the exploit lifecycle by recovering related transactions, contract code, and state dependencies. We present TxRay, a Large Language Model (LLM) agentic postmortem system that uses tool calls to reconstruct live ACT attacks from limited evidence. Starting from one or more seed transactions, TxRay recovers the exploit lifecycle, derives an evidence-backed root cause, and generates a runnable, self-contained Proof of Concept (PoC) that deterministically reproduces the incident. TxRay self-checks postmortems by encoding incident-specific semantic oracles as executable assertions. To evaluate PoC correctness and quality, we develop PoCEvaluator, an independent agentic execution-and-review evaluator. On 114 incidents from DeFiHackLabs, TxRay produces an expert-aligned root cause and an executable PoC for 105 incidents, achieving 92.11% end-to-end reproduction. Under PoCEvaluator, 98.1% of TxRay PoCs avoid hard-coding attacker addresses, a +24.8pp lift over DeFiHackLabs. In a live deployment, TxRay delivers validated root causes in 40 minutes and PoCs in 59 minutes at median latency. TxRay's oracle-validated PoCs enable attack imitation, improving coverage by 15.6% and 65.5% over STING and APE.

CRJan 30, 2025
Large Language Models for Cryptocurrency Transaction Analysis: A Bitcoin Case Study

Yuchen Lei, Yuexin Xiang, Qin Wang et al.

Cryptocurrencies are widely used, yet current methods for analyzing transactions often rely on opaque, black-box models. While these models may achieve high performance, their outputs are usually difficult to interpret and adapt, making it challenging to capture nuanced behavioral patterns. Large language models (LLMs) have the potential to address these gaps, but their capabilities in this area remain largely unexplored, particularly in cybercrime detection. In this paper, we test this hypothesis by applying LLMs to real-world cryptocurrency transaction graphs, with a focus on Bitcoin, one of the most studied and widely adopted blockchain networks. We introduce a three-tiered framework to assess LLM capabilities: foundational metrics, characteristic overview, and contextual interpretation. This includes a new, human-readable graph representation format, LLM4TG, and a connectivity-enhanced transaction graph sampling algorithm, CETraS. Together, they significantly reduce token requirements, transforming the analysis of multiple moderately large-scale transaction graphs with LLMs from nearly impossible to feasible under strict token limits. Experimental results demonstrate that LLMs have outstanding performance on foundational metrics and characteristic overview, where the accuracy of recognizing most basic information at the node level exceeds 98.50% and the proportion of obtaining meaningful characteristics reaches 95.00%. Regarding contextual interpretation, LLMs also demonstrate strong performance in classification tasks, even with very limited labeled data, where top-3 accuracy reaches 72.43% with explanations. While the explanations are not always fully accurate, they highlight the strong potential of LLMs in this domain. At the same time, several limitations persist, which we discuss along with directions for future research.

SENov 28, 2021
Semantic Code Search for Smart Contracts

Chaochen Shi, Yong Xiang, Jiangshan Yu et al.

Semantic code search technology allows searching for existing code snippets through natural language, which can greatly improve programming efficiency. Smart contracts, programs that run on the blockchain, have a code reuse rate of more than 90%, which means developers have a great demand for semantic code search tools. However, the existing code search models still have a semantic gap between code and query, and perform poorly on specialized queries of smart contracts. In this paper, we propose a Multi-Modal Smart contract Code Search (MM-SCS) model. Specifically, we construct a Contract Elements Dependency Graph (CEDG) for MM-SCS as an additional modality to capture the data-flow and control-flow information of the code. To make the model more focused on the key contextual information, we use a multi-head attention network to generate embeddings for code features. In addition, we use a fine-tuned pretrained model to ensure the model's effectiveness when the training data is small. We compared MM-SCS with four state-of-the-art models on a dataset with 470K (code, docstring) pairs collected from Github and Etherscan. Experimental results show that MM-SCS achieves an MRR (Mean Reciprocal Rank) of 0.572, outperforming four state-of-the-art models UNIF, DeepCS, CARLCS-CNN, and TAB-CS by 34.2%, 59.3%, 36.8%, and 14.1%, respectively. Additionally, the search speed of MM-SCS is second only to UNIF, reaching 0.34s/query.

SEJun 11, 2021
Low-level Comments auto-generation for Solidity Smart Contracts

Chaochen Shi, Yong Xiang, Jiangshan Yu et al.

Context: Decentralized applications on blockchain platforms are realized through smart contracts. However, participants who lack programming knowledge often have difficulties reading the smart contract source codes, which leads to potential security risks and barriers to participation. Objective: Our objective is to translate the smart contract source codes into natural language descriptions to help people better understand, operate, and learn smart contracts. Method: This paper proposes an automated translation tool for Solidity smart contracts, termed SolcTrans, based on an abstract syntax tree and formal grammar. We have investigated 3,000 smart contracts and determined the part of speeches of corresponding blockchain terms. Among them, we further filtered out contract snippets without detailed comments and left 811 snippets to evaluate the translation quality of SolcTrans. Results: Experimental results show that even with a small corpus, SolcTrans can achieve similar performance to the state-of-the-art code comments generation models for other programming languages. In addition, SolcTrans has consistent performance when dealing with code snippets with different lengths and gas consumption. Conclusion: SolcTrans can correctly interpret Solidity codes and automatically convert them into comprehensible English text. We will release our tool and dataset for supporting reproduction and further studies in related fields.

CRDec 14, 2020
Verifiable Observation of Permissioned Ledgers

Ermyas Abebe, Yining Hu, Allison Irvin et al.

Permissioned ledger technologies have gained significant traction over the last few years. For practical reasons, their applications have focused on transforming narrowly scoped use-cases in isolation. This has led to a proliferation of niche, isolated networks that are quickly becoming data and value silos. To increase value across the broader ecosystem, these networks must seamlessly integrate with existing systems and interoperate with one another. A fundamental requirement for enabling crosschain communication is the ability to prove the validity of the internal state of a ledger to an external party. However, due to the closed nature of permissioned ledgers, their internal state is opaque to an external observer. This makes consuming and verifying states from these networks a non-trivial problem. This paper addresses this fundamental requirement for state sharing across permissioned ledgers. In particular, we address two key problems for external clients: (i) assurances on the validity of state in a permissioned ledger and (ii) the ability to reason about the currency of state. We assume an adversarial model where the members of the committee managing the permissioned ledger can be malicious in the absence of detectability and accountability. We present a formalization of the problem for state sharing and examine its security properties under different adversarial conditions. We propose the design of a protocol that uses a secure public ledger for providing guarantees on safety and the ability to reason about time, with at least one honest member in the committee. We then provide a formal security analysis of our design and a proof of concept implementation based on Hyperledger Fabric demonstrating the effectiveness of the proposed protocol.

CRSep 1, 2020
Characterizing Erasable Accounts in Ethereum

Xiaoqi Li, Ting Chen, Xiapu Luo et al.

Being the most popular permissionless blockchain that supports smart contracts, Ethereum allows any user to create accounts on it. However, not all accounts matter. For example, the accounts due to attacks can be removed. In this paper, we conduct the first investigation on erasable accounts that can be removed to save system resources and even users' money (i.e., ETH or gas). In particular, we propose and develop a novel tool named GLASER, which analyzes the State DataBase of Ethereum to discover five kinds of erasable accounts. The experimental results show that GLASER can accurately reveal 508,482 erasable accounts and these accounts lead to users wasting more than 106 million dollars. GLASER can help stop further economic loss caused by these detected accounts. Moreover, GLASER characterizes the attacks/behaviors related to detected erasable accounts through graph analysis.

CRJul 18, 2020
How to Democratise and Protect AI: Fair and Differentially Private Decentralised Deep Learning

Lingjuan Lyu, Yitong Li, Karthik Nandakumar et al.

This paper firstly considers the research problem of fairness in collaborative deep learning, while ensuring privacy. A novel reputation system is proposed through digital tokens and local credibility to ensure fairness, in combination with differential privacy to guarantee privacy. In particular, we build a fair and differentially private decentralised deep learning framework called FDPDDL, which enables parties to derive more accurate local models in a fair and private manner by using our developed two-stage scheme: during the initialisation stage, artificial samples generated by Differentially Private Generative Adversarial Network (DPGAN) are used to mutually benchmark the local credibility of each party and generate initial tokens; during the update stage, Differentially Private SGD (DPSGD) is used to facilitate collaborative privacy-preserving deep learning, and local credibility and tokens of each party are updated according to the quality and quantity of individually released gradients. Experimental results on benchmark datasets under three realistic settings demonstrate that FDPDDL achieves high fairness, yields comparable accuracy to the centralised and distributed frameworks, and delivers better accuracy than the standalone framework.

CRAug 22, 2019
Security Analysis Methods on Ethereum Smart Contract Vulnerabilities: A Survey

Purathani Praitheeshan, Lei Pan, Jiangshan Yu et al.

Smart contracts are software programs featuring both traditional applications and distributed data storage on blockchains. Ethereum is a prominent blockchain platform with the support of smart contracts. The smart contracts act as autonomous agents in critical decentralized applications and hold a significant amount of cryptocurrency to perform trusted transactions and agreements. Millions of dollars as part of the assets held by the smart contracts were stolen or frozen through the notorious attacks just between 2016 and 2018, such as the DAO attack, Parity Multi-Sig Wallet attack, and the integer underflow/overflow attacks. These attacks were caused by a combination of technical flaws in designing and implementing software codes. However, many more vulnerabilities of less severity are to be discovered because of the scripting natures of the Solidity language and the non-updateable feature of blockchains. Hence, we surveyed 16 security vulnerabilities in smart contract programs, and some vulnerabilities do not have a proper solution. This survey aims to identify the key vulnerabilities in smart contracts on Ethereum in the perspectives of their internal mechanisms and software security vulnerabilities. By correlating 16 Ethereum vulnerabilities and 19 software security issues, we predict that many attacks are yet to be exploited. And we have explored many software tools to detect the security vulnerabilities of smart contracts in terms of static analysis, dynamic analysis, and formal verification. This survey presents the security problems in smart contracts together with the available analysis tools and the detection methods. We also investigated the limitations of the tools or analysis methods with respect to the identified security vulnerabilities of the smart contracts.

CRJun 4, 2019
Towards Fair and Privacy-Preserving Federated Deep Models

Lingjuan Lyu, Jiangshan Yu, Karthik Nandakumar et al.

The current standalone deep learning framework tends to result in overfitting and low utility. This problem can be addressed by either a centralized framework that deploys a central server to train a global model on the joint data from all parties, or a distributed framework that leverages a parameter server to aggregate local model updates. Server-based solutions are prone to the problem of a single-point-of-failure. In this respect, collaborative learning frameworks, such as federated learning (FL), are more robust. Existing federated learning frameworks overlook an important aspect of participation: fairness. All parties are given the same final model without regard to their contributions. To address these issues, we propose a decentralized Fair and Privacy-Preserving Deep Learning (FPPDL) framework to incorporate fairness into federated deep learning models. In particular, we design a local credibility mutual evaluation mechanism to guarantee fairness, and a three-layer onion-style encryption scheme to guarantee both accuracy and privacy. Different from existing FL paradigm, under FPPDL, each participant receives a different version of the FL model with performance commensurate with his contributions. Experiments on benchmark datasets demonstrate that FPPDL balances fairness, privacy and accuracy. It enables federated learning ecosystems to detect and isolate low-contribution parties, thereby promoting responsible participation.

NINov 9, 2017
ANCHOR: logically-centralized security for Software-Defined Networks

Diego Kreutz, Jiangshan Yu, Fernando M. V. Ramos et al.

While the centralization of SDN brought advantages such as a faster pace of innovation, it also disrupted some of the natural defenses of traditional architectures against different threats. The literature on SDN has mostly been concerned with the functional side, despite some specific works concerning non-functional properties like 'security' or 'dependability'. Though addressing the latter in an ad-hoc, piecemeal way, may work, it will most likely lead to efficiency and effectiveness problems. We claim that the enforcement of non-functional properties as a pillar of SDN robustness calls for a systemic approach. As a general concept, we propose ANCHOR, a subsystem architecture that promotes the logical centralization of non-functional properties. To show the effectiveness of the concept, we focus on 'security' in this paper: we identify the current security gaps in SDNs and we populate the architecture middleware with the appropriate security mechanisms, in a global and consistent manner. Essential security mechanisms provided by anchor include reliable entropy and resilient pseudo-random generators, and protocols for secure registration and association of SDN devices. We claim and justify in the paper that centralizing such mechanisms is key for their effectiveness, by allowing us to: define and enforce global policies for those properties; reduce the complexity of controllers and forwarding devices; ensure higher levels of robustness for critical services; foster interoperability of the non-functional property enforcement mechanisms; and promote the security and resilience of the architecture itself. We discuss design and implementation aspects, and we prove and evaluate our algorithms and mechanisms, including the formalisation of the main protocols and the verification of their core security properties using the Tamarin prover.

CRAug 5, 2014
DTKI: a new formalized PKI with no trusted parties

Jiangshan Yu, Vincent Cheval, Mark Ryan

The security of public key validation protocols for web-based applications has recently attracted attention because of weaknesses in the certificate authority model, and consequent attacks. Recent proposals using public logs have succeeded in making certificate management more transparent and verifiable. However, those proposals involve a fixed set of authorities. This means an oligopoly is created. Another problem with current log-based system is their heavy reliance on trusted parties that monitor the logs. We propose a distributed transparent key infrastructure (DTKI), which greatly reduces the oligopoly of service providers and allows verification of the behaviour of trusted parties. In addition, this paper formalises the public log data structure and provides a formal analysis of the security that DTKI guarantees.