EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation
This addresses the need for execution-grounded benchmarks in blockchain transaction scenarios to prevent irreversible user losses, though it is incremental as it builds on existing evaluation methods by adding execution accuracy and safety checks.
The paper tackles the problem of evaluating large language models for generating on-chain transaction scripts by introducing EVM-QuestBench, a benchmark that uses dynamic evaluation and execution on a forked EVM chain, finding large performance gaps among 20 models with persistent asymmetry in single-action precision versus multi-step workflow completion.
Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy and safety. We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with split scores revealing persistent asymmetry between single-action precision and multi-step workflow completion. Code: https://anonymous.4open.science/r/bsc_quest_bench-A9CF/.