SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
This work addresses the potential for LLMs to act as efficient surrogates in computational processes, which could benefit developers and researchers in data mining and code analysis, though it is incremental in exploring an underexplored application area.
The authors investigated whether large language models (LLMs) can serve as surrogate models for code execution prediction, introducing the SURGE benchmark with 1,160 problems across 8 aspects and evaluating 21 LLMs to reveal insights on feasibility, scaling laws, and predictive accuracy.
Neural surrogate models are powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as generation and understanding. However, an equally important yet underexplored question is whether LLMs can serve as surrogate models for code execution prediction. To systematically investigate it, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes. The benchmark and evaluation framework are available at https://github.com/Imbernoulli/SURGE.