Arthur Gervais

CR
h-index20
32papers
2,568citations
Novelty51%
AI Score58

32 Papers

86.2SEMay 27
SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers

Kaihua Qin, Dawn Song, Arthur Gervais

Smart contract decompilation aims to recover high-level source code from bytecode, but evaluating decompilers remains difficult because existing studies use narrow datasets, inconsistent metrics, and limited semantic consistency checks. This gap is increasingly important as large language models (LLMs) begin to generate source-like Solidity that may compile and appear plausible, even when its semantics diverge from the original contract. We introduce SCDBench, a dataset and benchmark methodology for LLM-based smart contract decompilation. The dataset contains 600 real-world Solidity contracts with paired bytecode inputs, ground-truth source code, and replayable semantic checkpoints. SCDBench evaluates decompiler outputs through four cumulative stages: format completeness, compilability, Application Binary Interface (ABI) recovery, and semantic consistency via differential replay. We evaluate Claude Opus 4.7, GPT-5.3-Codex, and GLM-5 in a zero-shot decompilation setting, including GLM-5 variants with and without extended reasoning and a zero-shot compilation-repair setting. The results show that frontier LLMs can often produce structured and compilable Solidity, but achieving semantic consistency remains far from solved: the best-performing frontier model perfectly decompiles only 42/600 contracts. We further show that introducing same-model compilation repair substantially improves performance at modest additional cost. SCDBench establishes a common ground for rigorous, reproducible evaluation and aims to accelerate the development of reliable smart contract decompilers for blockchain security and transparency.

CRApr 25, 2023
Blockchain Large Language Models

Yu Gai, Liyi Zhou, Kaihua Qin et al.

This paper presents a dynamic, real-time approach to detecting anomalous blockchain transactions. The proposed tool, BlockGPT, generates tracing representations of blockchain activity and trains from scratch a large language model to act as a real-time Intrusion Detection System. Unlike traditional methods, BlockGPT is designed to offer an unrestricted search space and does not rely on predefined rules or patterns, enabling it to detect a broader range of anomalies. We demonstrate the effectiveness of BlockGPT through its use as an anomaly detection tool for Ethereum transactions. In our experiments, it effectively identifies abnormal transactions among a dataset of 68M transactions and has a batched throughput of 2284 transactions per second on average. Our results show that, BlockGPT identifies abnormal transactions by ranking 49 out of 124 attacks among the top-3 most abnormal transactions interacting with their victim contracts. This work makes contributions to the field of blockchain transaction analysis by introducing a custom data encoding compatible with the transformer architecture, a domain-specific tokenization technique, and a tree encoding method specifically crafted for the Ethereum Virtual Machine (EVM) trace representation.

91.2GTMay 1
Your Loss is My Gain: Low Stake Attacks on Liquid Staking Pools

Sen Yang, Aviv Yaish, Arthur Gervais et al.

Permissionless Proof-of-Stake (PoS) economic security is predicated on the high cost of violating consensus safety or liveness. We show that liquid staking introduces additional risks that are not captured by standard PoS economic security arguments. Through an empirical study of Ethereum data, we find that the operational performance of liquid staking pools is positively associated with subsequent normalized liquid staking token (LST) returns. Motivated by this, we present a cross-layer attack: a low-stake adversary can manipulate the consensus protocol to degrade a target pool's performance and take application-layer positions that profit if the market reprices the corresponding \gls{LST} in-line with the historically observed association. To make the consensus layer manipulation concrete, we develop a deep reinforcement learning (DRL) framework to automatically discover attack strategies. Our evaluation shows that the learned strategies can recover near-optimal theoretical attacks and uncover new manipulation behaviors that significantly degrade target pool performance. We further characterize feasible application-layer monetization channels and analyze leveraged shorting in detail using Monte Carlo simulations, showing that such attacks can be profitable with over one-half probability for LSTs of major staking pools. Our findings reveal a previously overlooked attack surface in PoS systems with liquid staking and expose a gap between consensus and economic security.

STFeb 20, 2023
Exploring the Advantages of Transformers for High-Frequency Trading

Fazl Barez, Paul Bilokon, Arthur Gervais et al.

This paper explores the novel deep learning Transformers architectures for high-frequency Bitcoin-USDT log-return forecasting and compares them to the traditional Long Short-Term Memory models. A hybrid Transformer model, called \textbf{HFformer}, is then introduced for time series forecasting which incorporates a Transformer encoder, linear decoder, spiking activations, and quantile loss function, and does not use position encoding. Furthermore, possible high-frequency trading strategies for use with the HFformer model are discussed, including trade sizing, trading signal aggregation, and minimal trading threshold. Ultimately, the performance of the HFformer and Long Short-Term Memory models are assessed and results indicate that the HFformer achieves a higher cumulative PnL than the LSTM when trading with multiple signals during backtesting.

77.3CRMay 19
Measuring Safety Alignment Effects in Autonomous Security Agents

Isaac David, Arthur Gervais

Do stock safety-aligned language models and their uncensored or abliterated derivatives behave differently when run as autonomous security agents? Single-turn refusal benchmarks cannot answer this question: security agents must inspect repositories, call tools, and produce vulnerability evidence inside authorized sandboxes. We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, and compare four stock models against uncensored or abliterated derivatives: Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, and Llama 3.1 8B. The artifact contains 1,500 security-agent traces and 800 non-security control traces. The Gemma pairs show large less-restricted gains on security tasks: 14.0% versus 0.7% success for 31B and 10.7% versus 0.0% for 26B, with higher mean grounding (3.91 versus 3.27 and 4.12 versus 1.64 out of five) and 0.0% refusal, suppressed-action, and unsafe-action rates in the 31B traces. However, controls and non-Gemma pairs rule out a clean security-specific or universal less-restricted effect: Gemma gaps also appear on ordinary coding tasks, Qwen2.5-Coder success is lower for the less-restricted derivative (2.0% versus 5.3%), and the abliterated Llama derivative fails the tool protocol. Across all families, hard proof-of-trigger and patch-verification tasks remain unsolved. These results show that safety alignment effects in autonomous security agents should be measured at the system level, separating refusal, unsafe action, tool reliability, and evidence grounding rather than treating refusal rate as the safety signal.

77.9CRApr 20
Towards Optimal Agentic Architectures for Offensive Security Tasks

Isaac David, Arthur Gervais

Agentic security systems increasingly audit live targets with tool-using LLMs, but prior systems fix a single coordination topology, leaving unclear when additional agents help and when they only add cost. We treat topology choice as an empirical systems question. We introduce a controlled benchmark of 20 interactive targets (10 web/API and 10 binary), each exposing one endpoint-reachable ground-truth vulnerability, evaluated in whitebox and blackbox modes. The core study executes 600 runs over five architecture families, three model families, and both access modes, with a separate 60-run long-context pilot reported only in the appendix. On the completed core benchmark, detection-any reaches 58.0% and validated detection reaches 49.8%. MAS-Indep attains the highest validated detection rate (64.2%), while SAS is the strongest efficiency baseline at $0.058 per validated finding. Whitebox materially outperforms blackbox (67.0% vs. 32.7% validated detection), and web materially outperforms binary (74.3% vs. 25.3%). Bootstrap confidence intervals and paired target-level deltas show that the dominant effects are observability and domain, while some leading whitebox topologies remain statistically close. The main result is a non-monotonic cost-quality frontier: broader coordination can improve coverage, but it does not dominate once latency, token cost, and exploit-validation difficulty are taken into account.

73.3CRMay 17
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

Isaac David, Arthur Gervais

Safety-aligned language models often refuse cybersecurity requests whose wording resembles misuse, even when the task is authorized and defensive. This makes security evaluation ambiguous: a failed answer may reflect missing capability or refusal-policy intervention. Ablating Safety studies alignment removal as a controlled transformation-evaluation protocol for authorized security tasks, comparing authorized-context prompting, reversible refusal-direction activation projection, representation-control projections, and LoRA-based de-alignment or task adaptation. We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite of authorized security, benign general, and non-operational spillover probes. The reported runs include a four-model projection pilot with 416 completions, a three-model Qwen2.5 LoRA extension with 1,980 held-out completions, representation and robustness sweeps, and executable secure-repair validators. Single-vector refusal projection raises mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47; rank-4 refusal-subspace projection reaches 0.51 while matching the aligned spillover rate. Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13, while refusal-suppression with retention raises spillover to 0.27. These results support evaluating alignment removal as a utility-risk frontier, not as an uncensoring recipe, and treating compliance alone as neither competence nor safe deployment.

72.9SEMay 17
Benchmarking Mythos-Linked Bug Rediscovery

Isaac David, Arthur Gervais

Anthropic's April 2026 Mythos materials combine benchmark claims with concrete bug-finding stories across OpenBSD, FreeBSD, Linux, FFmpeg, and browsers. This paper reports a controlled target-file rediscovery experiment on six public or high-confidence Mythos-linked systems tasks. Each model receives the same target file or files, read-only source tools, three repeats per task, and one manual target-matching rubric; prompts omit CVE identifiers, patch hashes, advisory text, author names, disclosure dates, and answer key root cause language. The experiment contains 54 counted model-task attempts: three models, six tasks, and three repeats, giving 18 attempts per model. GPT-5.5 xhigh achieves 5/18 target rediscoveries, covering 2/6 tasks; counting one wrong-target mpegts.c finding separately gives 3/6 distinct core bugs. Claude Opus 4.7 achieves 1/18 target rediscoveries, covering 1/6 tasks. Kimi K2 records 0/18 target rediscoveries. The dominant failure mode is early commitment to plausible alternate candidates within the assigned file: models often submit source-grounded hypotheses while missing the specific invariant corrected by public Mythos patch evidence. These results do not refute Anthropic's undisclosed workflow, but show that under this favorable target-file scaffold, systems-specific prompting yields only six target matches across 54 counted attempts.

CRAug 28, 2025Code
Multi-Agent Penetration Testing AI for the Web

Isaac David, Arthur Gervais

AI-powered development platforms are making software creation accessible to a broader audience, but this democratization has triggered a scalability crisis in security auditing. With studies showing that up to 40% of AI-generated code contains vulnerabilities, the pace of development now vastly outstrips the capacity for thorough security assessment. We present MAPTA, a multi-agent system for autonomous web application security assessment that combines large language model orchestration with tool-grounded execution and end-to-end exploit validation. On the 104-challenge XBOW benchmark, MAPTA achieves 76.9% overall success with perfect performance on SSRF and misconfiguration vulnerabilities, 83% success on broken authorization, and strong results on injection attacks including server-side template injection (85%) and SQL injection (83%). Cross-site scripting (57%) and blind SQL injection (0%) remain challenging. Our comprehensive cost analysis across all challenges totals $21.38 with a median cost of $0.073 for successful attempts versus $0.357 for failures. Success correlates strongly with resource efficiency, enabling practical early-stopping thresholds at approximately 40 tool calls or $0.30 per challenge. MAPTA's real-world findings are impactful given both the popularity of the respective scanned GitHub repositories (8K-70K stars) and MAPTA's low average operating cost of $3.67 per open-source assessment: MAPTA discovered critical vulnerabilities including RCEs, command injections, secret exposure, and arbitrary file write vulnerabilities. Findings are responsibly disclosed, 10 findings are under CVE review.

80.9SEMay 11
CrackMeBench: Binary Reverse Engineering for Agents

Isaac David, Arthur Gervais

Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag performance. Classical binary reverse engineering remains less precisely specified: given only an executable, can an agent recover validation logic and produce an input, serial, artifact, or key generator accepted by the program? We introduce CrackMeBench, a benchmark for evaluating language-model agents on educational CrackMe-style reverse-engineering tasks. CrackMeBench focuses on deterministic binary validation problems with executable oracles, symbol-poor binaries, explicit local tool access, and externally scored submissions rather than free-form explanations. The v0 benchmark combines eight public calibration CrackMes with twelve generated main-score tasks built from seeded C, Rust, and Go templates, and agents run through an equal shell interface in a no-network Linux Docker sandbox with standard reverse-engineering tools. In a three-model evaluation with a five-minute budget and three scored submissions per task, pass@3 on the generated split is 11/12 tasks (92%) for GPT-5.5, 7/12 (58%) for Claude Opus 4.7, and 5/12 (42%) for Kimi K2. The harder generated half separates the models more sharply, with pass@3 of 5/6, 2/6, and 1/6, respectively; on the eight-task public calibration split, pass@3 is 3/8, 2/8, and 1/8. CrackMeBench records pass@1 and pass@3, scored submissions, wall-clock time, command traces, tool categories, provider-reported token usage, estimated cost, and qualitative failure labels, providing a reproducible testbed for measuring progress from source-code reasoning toward autonomous binary analysis while restricting scope to educational, purpose-built programs.

59.7CRMay 7
Patch2Vuln: Agentic Reconstruction of Vulnerabilities from Linux Distribution Binary Patches

Isaac David, Arthur Gervais

Security updates create a short but important window in which defenders and attackers can compare vulnerable and patched software. Yet in many operational settings, the most accessible artifacts are binary packages rather than source patches or advisory text. This paper asks whether a language-model agent, restricted to local binary-derived evidence, can reconstruct the security meaning of Linux distribution updates. Patch2Vuln is a local, resumable pipeline that extracts old/new ELF pairs, diffs them with Ghidra and Ghidriff, ranks changed functions, builds candidate dossiers, and asks an offline agent to produce a preliminary audit, bounded validation plan, and final audit. We evaluate Patch2Vuln on 25 Ubuntu `.deb` package pairs: 20 security-update pairs and five negative controls, all manually adjudicated against private source-patch and binary-function ground truth. The agent localizes a verified security-relevant patch function in 10 of 20 security pairs and assigns an accepted final root-cause class in 11 of 20. Oracle diagnostics show that six security pairs fail before model reasoning because the binary differ or ranker omits the right function, with one additional context-export miss. A separate bounded validation pass produces two target-level minimized behavioral old/new differentials, both for tcpdump, but no crash, timeout, sanitizer finding, or memory-corruption proof; all five negative controls are classified as unknown and produce no validation differentials. These results support agentic vulnerability reconstruction from binary patches as a useful research target while showing that binary-diff coverage and local behavioral validation remain the limiting components.

89.0CRApr 30
Alignment Contracts for Agentic Security Systems

Isaac David, Marco Guarnieri, Arthur Gervais

Agentic security systems increasingly combine LLM planners with tools that can discover, validate, and report vulnerabilities. This creates an asymmetric control problem: the system should retain strong offensive capability inside an authorized engagement, while the same capabilities must be denied outside scope. Existing guardrails provide useful policy controls, but they do not make this boundary a first-class formal contract over observable effects. We introduce alignment contracts, a framework for specifying and enforcing behavioral constraints over observable effect traces. A contract defines scope, allowed and forbidden effects, resource budgets, and disclosure policies. We give the language finite-trace semantics, characterize satisfaction as a safety property with finite violation witnesses, develop refinement and one-way composition rules for modular contract engineering, and show that admissibility checking is decidable. We instantiate the framework for web-focused agentic security workflows and show how the same structure extends to other effect profiles. Under an explicit Effect Observability Assumption, where all $\SigmaEff$-effects are mediated, the soundness theorem quantifies over the agent model and gives guarantees for mediated $\SigmaEff$-effects, including enforcement soundness for monitor-realized traces. We also state an assumption-lifted adaptation result and formalize limits through undecidability transfer and observability-boundary theorems. A Lean 4 artifact checks the formal core theorems used by the paper.

CRJul 8, 2025
AI Agent Smart Contract Exploit Generation

Arthur Gervais, Liyi Zhou

Smart contract vulnerabilities have led to billions in losses, yet finding actionable exploits remains challenging. Traditional fuzzers rely on rigid heuristics and struggle with complex attacks, while human auditors are thorough but slow and don't scale. Large Language Models offer a promising middle ground, combining human-like reasoning with machine speed. However, early studies show that simply prompting LLMs generates unverified vulnerability speculations with high false positive rates. To address this, we present A1, an agentic system that transforms any LLM into an end-to-end exploit generator. A1 provides agents with six domain-specific tools for autonomous vulnerability discovery, from understanding contract behavior to testing strategies on real blockchain states. All outputs are concretely validated through execution, ensuring only profitable proof-of-concept exploits are reported. We evaluate A1 across 36 real-world vulnerable contracts on Ethereum and Binance Smart Chain. A1 achieves a 63% success rate on the VERITE benchmark. Across all successful cases, A1 extracts up to \$8.59 million per exploit and \$9.33 million total. Through 432 experiments across six LLMs, we show that most exploits emerge within five iterations, with costs ranging \$0.01-\$3.59 per attempt. Using Monte Carlo analysis of historical attacks, we demonstrate that immediate vulnerability detection yields 86-89% success probability, dropping to 6-21% with week-long delays. Our economic analysis reveals a troubling asymmetry: attackers achieve profitability at \$6,000 exploit values while defenders require \$60,000 -- raising fundamental questions about whether AI agents inevitably favor exploitation over defense.

CRMar 10, 2025
AuthorMist: Evading AI Text Detectors with Reinforcement Learning

Isaac David, Arthur Gervais

In the age of powerful AI-generated text, automatic detectors have emerged to identify machine-written content. This poses a threat to author privacy and freedom, as text authored with AI assistance may be unfairly flagged. We propose AuthorMist, a novel reinforcement learning-based system to transform AI-generated text into human-like writing. AuthorMist leverages a 3-billion-parameter language model as a backbone, fine-tuned with Group Relative Policy Optimization (GPRO) to paraphrase text in a way that evades AI detectors. Our framework establishes a generic approach where external detector APIs (GPTZero, WinstonAI, Originality.ai, etc.) serve as reward functions within the reinforcement learning loop, enabling the model to systematically learn outputs that these detectors are less likely to classify as AI-generated. This API-as-reward methodology can be applied broadly to optimize text against any detector with an accessible interface. Experiments on multiple datasets and detectors demonstrate that AuthorMist effectively reduces the detectability of AI-generated text while preserving the original meaning. Our evaluation shows attack success rates ranging from 78.6% to 96.2% against individual detectors, significantly outperforming baseline paraphrasing methods. AuthorMist maintains high semantic similarity (above 0.94) with the original text while successfully evading detection. These results highlight limitations in current AI text detection technologies and raise questions about the sustainability of the detection-evasion arms race.

CRFeb 1
TxRay: Agentic Postmortem of Live Blockchain Attacks

Ziyue Wang, Jiangshan Yu, Kaihua Qin et al.

Decentralized Finance (DeFi) has turned blockchains into financial infrastructure, allowing anyone to trade, lend, and build protocols without intermediaries, but this openness exposes pools of value controlled by code. Within five years, the DeFi ecosystem has lost over 15.75B USD to reported exploits. Many exploits arise from permissionless opportunities that any participant can trigger using only public state and standard interfaces, which we call Anyone-Can-Take (ACT) opportunities. Despite on-chain transparency, postmortem analysis remains slow and manual: investigations start from limited evidence, sometimes only a single transaction hash, and must reconstruct the exploit lifecycle by recovering related transactions, contract code, and state dependencies. We present TxRay, a Large Language Model (LLM) agentic postmortem system that uses tool calls to reconstruct live ACT attacks from limited evidence. Starting from one or more seed transactions, TxRay recovers the exploit lifecycle, derives an evidence-backed root cause, and generates a runnable, self-contained Proof of Concept (PoC) that deterministically reproduces the incident. TxRay self-checks postmortems by encoding incident-specific semantic oracles as executable assertions. To evaluate PoC correctness and quality, we develop PoCEvaluator, an independent agentic execution-and-review evaluator. On 114 incidents from DeFiHackLabs, TxRay produces an expert-aligned root cause and an executable PoC for 105 incidents, achieving 92.11% end-to-end reproduction. Under PoCEvaluator, 98.1% of TxRay PoCs avoid hard-coding attacker addresses, a +24.8pp lift over DeFiHackLabs. In a live deployment, TxRay delivers validated root causes in 40 minutes and PoCs in 59 minutes at median latency. TxRay's oracle-validated PoCs enable attack imitation, improving coverage by 15.6% and 65.5% over STING and APE.

CRJan 22, 2022
On How Zero-Knowledge Proof Blockchain Mixers Improve, and Worsen User Privacy

Zhipeng Wang, Stefanos Chaliasos, Kaihua Qin et al.

Zero-knowledge proof (ZKP) mixers are one of the most widely-used blockchain privacy solutions, operating on top of smart contract-enabled blockchains. We find that ZKP mixers are tightly intertwined with the growing number of Decentralized Finance (DeFi) attacks and Blockchain Extractable Value (BEV) extractions. Through coin flow tracing, we discover that 205 blockchain attackers and 2,595 BEV extractors leverage mixers as their source of funds, while depositing a total attack revenue of 412.87M USD. Moreover, the US OFAC sanctions against the largest ZKP mixer, Tornado.Cash, have reduced the mixer's daily deposits by more than 80%. Further, ZKP mixers advertise their level of privacy through a so-called anonymity set size, which similarly to k-anonymity allows a user to hide among a set of k other users. Through empirical measurements, we, however, find that these anonymity set claims are mostly inaccurate. For the most popular mixers on Ethereum (ETH) and Binance Smart Chain (BSC), we show how to reduce the anonymity set size on average by 27.34% and 46.02% respectively. Our empirical evidence is also the first to suggest a differing privacy-predilection of users on ETH and BSC. State-of-the-art ZKP mixers are moreover interwoven with the DeFi ecosystem by offering anonymity mining (AM) incentives, i.e., users receive monetary rewards for mixing coins. However, contrary to the claims of related work, we find that AM does not necessarily improve the quality of a mixer's anonymity set. Our findings indicate that AM attracts privacy-ignorant users, who then do not contribute to improving the privacy of other mixer users.

CRDec 13, 2021
Proof of Steak

Jon Crowcroft, Hamed Haddadi, Arthur Gervais et al.

We introduce Proof-of-Steak (PoS) as a fundamental net-zero block generation technique, often accompanied by Non-Frangipane Tokens. Genesis cut is gradually heated and minted (using the appropriate sauce), enabling the miners to redirect the extracted gold and the dissipated heat into the furnace, hence enabling the first fully-circular economy ever built using blockchain technology, utilising tamper-evident steak haché. In this paper we present the basic ingredients for building Proof-of-Steak, assessing its global impact, and opportunities to save the world and beyond!

CRSep 23, 2021
Towards Private On-Chain Algorithmic Trading

Ceren Kocaoğullar, Arthur Gervais, Benjamin Livshits

While quantitative automation related to trading crypto-assets such as ERC-20 tokens has become relatively commonplace, with services such as 3Commas and Shrimpy offering user-friendly web-driven services for even the average crypto trader, we have not yet seen the emergence of on-chain trading as a phenomenon. We hypothesize that just like decentralized exchanges (DEXes) that by now are by some measures more popular than traditional exchanges, process in the space of decentralized finance (DeFi) may enable attractive online trading automation options. In this paper we present ChainBot, an approach for creating algorithmic trading bots with the help of blockchain technology. We show how to partition the computation into on- and off-chain components in a way that provides a measure of end-to-end integrity, while preserving the algorithmic "secret sauce". Our system is enabled with a careful use of algorithm partitioning, zero-knowledge proofs and smart contracts. We also show that with layer-2 (L2) technologies, trades can be kept private, which means that algorithmic parameters are difficult to recover by a chain observer. Our approach offers more transparent access to liquidity and better censorship-resistance compared to traditional off-chain trading approaches. We develop a sample ChainBot and train it on historical data, resulting in returns that are up to 2.4x the buy-and-hold strategy, which we use as our baseline. Our measurements show that across 1000 runs, the end-to-end average execution time for our system is 48.4 seconds. We demonstrate that the frequency of trading does not significantly affect the rate of return and Sharpe ratio, which indicates that we do not have to trade at every block, thereby significantly saving in terms of gas fees. In our implementation, a user who invests \$1,000 would earn \$105, and spend \$3 on gas; assuming a user pool of 1,000 subscribers.

GNJun 15, 2021
CeFi vs. DeFi -- Comparing Centralized to Decentralized Finance

Kaihua Qin, Liyi Zhou, Yaroslav Afonin et al.

To non-experts, the traditional Centralized Finance (CeFi) ecosystem may seem obscure, because users are typically not aware of the underlying rules or agreements of financial assets and products. Decentralized Finance (DeFi), however, is making its debut as an ecosystem claiming to offer transparency and control, which are partially attributable to the underlying integrity-protected blockchain, as well as currently higher financial asset yields than CeFi. Yet, the boundaries between CeFi and DeFi may not be always so clear cut. In this work, we systematically analyze the differences between CeFi and DeFi, covering legal, economic, security, privacy and market manipulation. We provide a structured methodology to differentiate between a CeFi and a DeFi service. Our findings show that certain DeFi assets (such as USDC or USDT stablecoins) do not necessarily classify as DeFi assets, and may endanger the economic security of intertwined DeFi protocols. We conclude this work with the exploration of possible synergies between CeFi and DeFi.

CRJun 14, 2021
A2MM: Mitigating Frontrunning, Transaction Reordering and Consensus Instability in Decentralized Exchanges

Liyi Zhou, Kaihua Qin, Arthur Gervais

The asset trading volume on blockchain-based exchanges (DEX) increased substantially since the advent of Automated Market Makers (AMM). Yet, AMMs and their forks compete on the same blockchain, incurring unnecessary network and block-space overhead, by attracting sandwich attackers and arbitrage competitions. Moreover, conceptually speaking, a blockchain is one database, and we find little reason to partition this database into multiple competing exchanges, which then necessarily require price synchronization through arbitrage. This paper shows that DEX arbitrage and trade routing among similar AMMs can be performed efficiently and atomically on-chain within smart contracts. These insights lead us to create a new AMM design, an Automated Arbitrage Market Maker, short A2MM DEX. A2MM aims to unite multiple AMMs to reduce overheads, costs and increase blockchain security. With respect to Miner Extractable Value (MEV), A2MM serves as a decentralized design for users to atomically collect MEV, mitigating the dangers of centralized MEV relay services. We show that A2MM offers essential security benefits. First, A2MM strengthens the blockchain consensus security by mitigating the competitive exploitation of MEV, therefore reducing the risks of consensus forks. A2MM reduces the network layer overhead of competitive transactions, improves network propagation, leading to less stale blocks and better blockchain security. Through trade routing, A2MM reduces the predatory risks of sandwich attacks by taking advantage of the minimum profitable victim input. A2MM also offers financial benefits to traders. Failed swap transactions from competitive trading occupy valuable block space, implying an upward pressure on transaction fees. Our evaluations shows that A2MM frees up 32.8% block-space of AMM-related transactions. In expectation, A2MM's revenue allows to reduce swap fees by 90%.

GNJun 11, 2021
An Empirical Study of DeFi Liquidations: Incentives, Risks, and Instabilities

Kaihua Qin, Liyi Zhou, Pablo Gamito et al.

Financial speculators often seek to increase their potential gains with leverage. Debt is a popular form of leverage, and with over 39.88B USD of total value locked (TVL), the Decentralized Finance (DeFi) lending markets are thriving. Debts, however, entail the risks of liquidation, the process of selling the debt collateral at a discount to liquidators. Nevertheless, few quantitative insights are known about the existing liquidation mechanisms. In this paper, to the best of our knowledge, we are the first to study the breadth of the borrowing and lending markets of the Ethereum DeFi ecosystem. We focus on Aave, Compound, MakerDAO, and dYdX, which collectively represent over 85% of the lending market on Ethereum. Given extensive liquidation data measurements and insights, we systematize the prevalent liquidation mechanisms and are the first to provide a methodology to compare them objectively. We find that the existing liquidation designs well incentivize liquidators but sell excessive amounts of discounted collateral at the borrowers' expenses. We measure various risks that liquidation participants are exposed to and quantify the instabilities of existing lending protocols. Moreover, we propose an optimal strategy that allows liquidators to increase their liquidation profit, which may aggravate the loss of borrowers.

CRMar 3, 2021
On the Just-In-Time Discovery of Profit-Generating Transactions in DeFi Protocols

Liyi Zhou, Kaihua Qin, Antoine Cully et al.

In this paper, we investigate two methods that allow us to automatically create profitable DeFi trades, one well-suited to arbitrage and the other applicable to more complicated settings. We first adopt the Bellman-Ford-Moore algorithm with DEFIPOSER-ARB and then create logical DeFi protocol models for a theorem prover in DEFIPOSER-SMT. While DEFIPOSER-ARB focuses on DeFi transactions that form a cycle and performs very well for arbitrage, DEFIPOSER-SMT can detect more complicated profitable transactions. We estimate that DEFIPOSER-ARB and DEFIPOSER-SMT can generate an average weekly revenue of 191.48ETH (76,592USD) and 72.44ETH (28,976USD) respectively, with the highest transaction revenue being 81.31ETH(32,524USD) and22.40ETH (8,960USD) respectively. We further show that DEFIPOSER-SMT finds the known economic bZx attack from February 2020, which yields 0.48M USD. Our forensic investigations show that this opportunity existed for 69 days and could have yielded more revenue if exploited one day earlier. Our evaluation spans 150 days, given 96 DeFi protocol actions, and 25 assets. Looking beyond the financial gains mentioned above, forks deteriorate the blockchain consensus security, as they increase the risks of double-spending and selfish mining. We explore the implications of DEFIPOSER-ARB and DEFIPOSER-SMT on blockchain consensus. Specifically, we show that the trades identified by our tools exceed the Ethereum block reward by up to 874x. Given optimal adversarial strategies provided by a Markov Decision Process (MDP), we quantify the value threshold at which a profitable transaction qualifies as Miner ExtractableValue (MEV) and would incentivize MEV-aware miners to fork the blockchain. For instance, we find that on Ethereum, a miner with a hash rate of 10% would fork the blockchain if an MEV opportunity exceeds 4x the block reward.

CRJan 15, 2021
The Eye of Horus: Spotting and Analyzing Attacks on Ethereum Smart Contracts

Christof Ferreira Torres, Antonio Ken Iannillo, Arthur Gervais et al.

In recent years, Ethereum gained tremendously in popularity, growing from a daily transaction average of 10K in January 2016 to an average of 500K in January 2020. Similarly, smart contracts began to carry more value, making them appealing targets for attackers. As a result, they started to become victims of attacks, costing millions of dollars. In response to these attacks, both academia and industry proposed a plethora of tools to scan smart contracts for vulnerabilities before deploying them on the blockchain. However, most of these tools solely focus on detecting vulnerabilities and not attacks, let alone quantifying or tracing the number of stolen assets. In this paper, we present Horus, a framework that empowers the automated detection and investigation of smart contract attacks based on logic-driven and graph-driven analysis of transactions. Horus provides quick means to quantify and trace the flow of stolen assets across the Ethereum blockchain. We perform a large-scale analysis of all the smart contracts deployed on Ethereum until May 2020. We identified 1,888 attacked smart contracts and 8,095 adversarial transactions in the wild. Our investigation shows that the number of attacks did not necessarily decrease over the past few years, but for some vulnerabilities remained constant. Finally, we also demonstrate the practicality of our framework via an in-depth analysis on the recent Uniswap and Lendf.me attacks.

CRJan 14, 2021
Quantifying Blockchain Extractable Value: How dark is the forest?

Kaihua Qin, Liyi Zhou, Arthur Gervais

Permissionless blockchains such as Bitcoin have excelled at financial services. Yet, opportunistic traders extract monetary value from the mesh of decentralized finance (DeFi) smart contracts through so-called blockchain extractable value (BEV). The recent emergence of centralized BEV relayer portrays BEV as a positive additional revenue source. Because BEV was quantitatively shown to deteriorate the blockchain's consensus security, BEV relayers endanger the ledger security by incentivizing rational miners to fork the chain. For example, a rational miner with a 10% hashrate will fork Ethereum if a BEV opportunity exceeds 4x the block reward. However, related work is currently missing quantitative insights on past BEV extraction to assess the practical risks of BEV objectively. In this work, we allow to quantify the BEV danger by deriving the USD extracted from sandwich attacks, liquidations, and decentralized exchange arbitrage. We estimate that over 32 months, BEV yielded 540.54M USD in profit, divided among 11,289 addresses when capturing 49,691 cryptocurrencies and 60,830 on-chain markets. The highest BEV instance we find amounts to 4.1M USD, 616.6x the Ethereum block reward. Moreover, while the practitioner's community has discussed the existence of generalized trading bots, we are, to our knowledge, the first to provide a concrete algorithm. Our algorithm can replace unconfirmed transactions without the need to understand the victim transactions' underlying logic, which we estimate to have yielded a profit of 57,037.32 ETH (35.37M USD) over 32 months of past blockchain data. Finally, we formalize and analyze emerging BEV relay systems, where miners accept BEV transactions from a centralized relay server instead of the peer-to-peer (P2P) network. We find that such relay systems aggravate the consensus layer attacks and therefore further endanger blockchain security.

CROct 2, 2020
AMR:Autonomous Coin Mixer with Privacy Preserving Reward Distribution

Duc V. Le, Arthur Gervais

It is well known that users on open blockchains are tracked by an industry providing services to governments, law enforcement, secret services, and alike. While most blockchains do not protect their users' privacy and allow external observers to link transactions and addresses, a growing research interest attempts to design add-on privacy solutions to help users regain their privacy on non-private blockchains. In this work, we propose to our knowledge the first censorship resilient mixer, which can reward its users in a privacy-preserving manner for participating in the system. Increasing the anonymity set size, and diversity of users, is, as we believe, an important endeavor to raise a mixer's contributed privacy in practice. The paid-out rewards can take the form of governance tokens to decentralize the voting on system parameters, similar to how popular "DeFi farming" protocols operate. Moreover, by leveraging existing "Defi" lending platforms, AMR is the first mixer design that allows participating clients to earn financial interests on their deposited funds. Our system AMR is autonomous as it does not rely on any external server or third party. The evaluation of our AMR implementation shows that the system supports today on Ethereum anonymity set sizes beyond thousands of users, and a capacity of over $66,000$ deposits per day, at constant system costs. We provide a formal specification of our zksnark-based AMR system, a privacy and security analysis, implementation, and evaluation with both the MiMC and Poseidon hash functions.

CRSep 29, 2020
High-Frequency Trading on Decentralized On-Chain Exchanges

Liyi Zhou, Kaihua Qin, Christof Ferreira Torres et al.

Decentralized exchanges (DEXs) allow parties to participate in financial markets while retaining full custody of their funds. However, the transparency of blockchain-based DEX in combination with the latency for transactions to be processed, makes market-manipulation feasible. For instance, adversaries could perform front-running -- the practice of exploiting (typically non-public) information that may change the price of an asset for financial gain. In this work we formalize, analytically exposit and empirically evaluate an augmented variant of front-running: sandwich attacks, which involve front- and back-running victim transactions on a blockchain-based DEX. We quantify the probability of an adversarial trader being able to undertake the attack, based on the relative positioning of a transaction within a blockchain block. We find that a single adversarial trader can earn a daily revenue of over several thousand USD when performing sandwich attacks on one particular DEX -- Uniswap, an exchange with over 5M USD daily trading volume by June 2020. In addition to a single-adversary game, we simulate the outcome of sandwich attacks under multiple competing adversaries, to account for the real-world trading environment.

CRAug 26, 2020
FileBounty: Fair Data Exchange

Simon Janin, Kaihua Qin, Akaki Mamageishvili et al.

Digital contents are typically sold online through centralized and custodian marketplaces, which requires the trading partners to trust a central entity. We present FileBounty, a fair protocol which, assuming the cryptographic hash of the file of interest is known to the buyer, is trust-free and lets a buyer purchase data for a previously agreed monetary amount, while guaranteeing the integrity of the contents. To prevent misbehavior, FileBounty guarantees that any deviation from the expected participants' behavior results in a negative financial payoff; i.e. we show that honest behavior corresponds to a subgame perfect Nash equilibrium. Our novel deposit refunding scheme is resistant to extortion attacks under rational adversaries. If buyer and seller behave honestly, FileBounty's execution requires only three on-chain transactions, while the actual data is exchanged off-chain in an efficient and privacy-preserving manner. We moreover show how FileBounty enables a flexible peer-to-peer setting where multiple parties fairly sell a file to a buyer.

CRAug 26, 2020
Applying Private Information Retrieval to Lightweight Bitcoin Clients

Kaihua Qin, Henryk Hadass, Arthur Gervais et al.

Lightweight Bitcoin clients execute a Simple Payment Verification (SPV) protocol to verify the validity of transactions related to a particular user. Currently, lightweight clients use Bloom filters to significantly reduce the amount of bandwidth required to validate a particular transaction. This is despite the fact that research has shown that Bloom filters are insufficient at preserving the privacy of clients' queries. In this paper we describe our design of an SPV protocol that leverages Private Information Retrieval (PIR) to create fully private and performant queries. We show that our protocol has a low bandwidth and latency cost; properties that make our protocol a viable alternative for lightweight Bitcoin clients and other cryptocurrencies with a similar SPV model. In contract to Bloom filters, our PIR-based approach offers deterministic privacy to the user. Among our results, we show that in the worst case, clients who would like to verify 100 transactions occurring in the past week incurs a bandwidth cost of 33.54 MB with an associated latency of approximately 4.8 minutes, when using our protocol. The same query executed using the Bloom-filter-based SPV protocol incurs a bandwidth cost of 12.85 MB; this is a modest overhead considering the privacy guarantees it provides.

CRMay 25, 2020
ConFuzzius: A Data Dependency-Aware Hybrid Fuzzer for Smart Contracts

Christof Ferreira Torres, Antonio Ken Iannillo, Arthur Gervais et al.

Smart contracts are Turing-complete programs that are executed across a blockchain. Unlike traditional programs, once deployed, they cannot be modified. As smart contracts carry more value, they become more of an exciting target for attackers. Over the last years, they suffered from exploits costing millions of dollars due to simple programming mistakes. As a result, a variety of tools for detecting bugs have been proposed. Most of these tools rely on symbolic execution, which may yield false positives due to over-approximation. Recently, many fuzzers have been proposed to detect bugs in smart contracts. However, these tend to be more effective in finding shallow bugs and less effective in finding bugs that lie deep in the execution, therefore achieving low code coverage and many false negatives. An alternative that has proven to achieve good results in traditional programs is hybrid fuzzing, a combination of symbolic execution and fuzzing. In this work, we study hybrid fuzzing on smart contracts and present ConFuzzius, the first hybrid fuzzer for smart contracts. ConFuzzius uses evolutionary fuzzing to exercise shallow parts of a smart contract and constraint solving to generate inputs that satisfy complex conditions that prevent evolutionary fuzzing from exploring deeper parts. Moreover, ConFuzzius leverages dynamic data dependency analysis to efficiently generate sequences of transactions that are more likely to result in contract states in which bugs may be hidden. We evaluate the effectiveness of ConFuzzius by comparing it with state-of-the-art symbolic execution tools and fuzzers for smart contracts. Our evaluation on a curated dataset of 128 contracts and 21K real-world contracts shows that our hybrid approach detects more bugs (up to 23%) while outperforming state-of-the-art in terms of code coverage (up to 69%), and that data dependency analysis boosts bug detection up to 18%.

CRMar 8, 2020
Attacking the DeFi Ecosystem with Flash Loans for Fun and Profit

Kaihua Qin, Liyi Zhou, Benjamin Livshits et al.

Credit allows a lender to loan out surplus capital to a borrower. In the traditional economy, credit bears the risk that the borrower may default on its debt, the lender hence requires upfront collateral from the borrower, plus interest fee payments. Due to the atomicity of blockchain transactions, lenders can offer flash loans, i.e., loans that are only valid within one transaction and must be repaid by the end of that transaction. This concept has lead to a number of interesting attack possibilities, some of which were exploited in February 2020. This paper is the first to explore the implication of transaction atomicity and flash loans for the nascent decentralized finance (DeFi) ecosystem. We show quantitatively how transaction atomicity increases the arbitrage revenue. We moreover analyze two existing attacks with ROIs beyond 500k%. We formulate finding the attack parameters as an optimization problem over the state of the underlying Ethereum blockchain and the state of the DeFi ecosystem. We show how malicious adversaries can efficiently maximize an attack profit and hence damage the DeFi ecosystem further. Specifically, we present how two previously executed attacks can be "boosted" to result in a profit of 829.5k USD and 1.1M USD, respectively, which is a boost of 2.37x and 1.73x, respectively.

CRFeb 19, 2020
The Decentralized Financial Crisis

Lewis Gudgeon, Daniel Perez, Dominik Harz et al.

The Global Financial Crisis of 2008, caused by the accumulation of excessive financial risk, inspired Satoshi Nakamoto to create Bitcoin. Now, more than ten years later, Decentralized Finance (DeFi), a peer-to-peer financial paradigm which leverages blockchain-based smart contracts to ensure its integrity and security, contains over 702m USD of capital as of April 15th, 2020. As this ecosystem develops, it is at risk of the very sort of financial meltdown it is supposed to be preventing. In this paper we explore how design weaknesses and price fluctuations in DeFi protocols could lead to a DeFi crisis. We focus on DeFi lending protocols as they currently constitute most of the DeFi ecosystem with a 76% market share by capital as of April 15th, 2020. First, we demonstrate the feasibility of attacking Maker's governance design to take full control of the protocol, the largest DeFi protocol by market share, which would have allowed the theft of 0.5bn USD of collateral and the minting of an unlimited supply of DAI tokens. In doing so, we present a novel strategy utilizing so-called flash loans that would have in principle allowed the execution of the governance attack in just two transactions and without the need to lock any assets. Approximately two weeks after we disclosed the attack details, Maker modified the governance parameters mitigating the attack vectors. Second, we turn to a central component of financial risk in DeFi lending protocols. Inspired by stress-testing as performed by central banks, we develop a stress-testing framework for a stylized DeFi lending protocol, focusing our attention on the impact of a drying-up of liquidity on protocol solvency. Based on our parameters, we find that with sufficiently illiquidity a lending protocol with a total debt of 400m USD could become undercollateralized within 19 days.

CRJun 4, 2018
Securify: Practical Security Analysis of Smart Contracts

Petar Tsankov, Andrei Dan, Dana Drachsler Cohen et al.

Permissionless blockchains allow the execution of arbitrary programs (called smart contracts), enabling mutually untrusted entities to interact without relying on trusted third parties. Despite their potential, repeated security concerns have shaken the trust in handling billions of USD by smart contracts. To address this problem, we present Securify, a security analyzer for Ethereum smart contracts that is scalable, fully automated, and able to prove contract behaviors as safe/unsafe with respect to a given property. Securify's analysis consists of two steps. First, it symbolically analyzes the contract's dependency graph to extract precise semantic information from the code. Then, it checks compliance and violation patterns that capture sufficient conditions for proving if a property holds or not. To enable extensibility, all patterns are specified in a designated domain-specific language. Securify is publicly released, it has analyzed >18K contracts submitted by its users, and is regularly used to conduct security audits by experts. We present an extensive evaluation of Securify over real-world Ethereum smart contracts and demonstrate that it can effectively prove the correctness of smart contracts and discover critical violations.