SE CLMar 31, 2024

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, Carolyn Rose

arXiv:2404.00566v417.222 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This addresses the need for better benchmarks in code generation research, though it is incremental as it builds on existing datasets and methods.

The authors tackled the problem of evaluating code generation systems by creating scalable execution-based benchmarks from natural code sources, resulting in a dataset of 1,931 examples where 81.3% are human-solvable and 61% require effort.

To adequately test modern code generation systems, evaluation benchmarks must execute and test the code generated by the system. However, these execution and testing requirements have largely limited benchmarks to settings where code is easily executable or has human-written tests. To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks from naturally occurring code sources. Specifically, we leverage a large language model (LLM) to sandbox arbitrary pieces of code into evaluation examples, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries converted from code in 367 GitHub repositories taken from the Code- SearchNet dataset. To demonstrate the solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans and 61% are rated as "requires effort to solve". We conduct code generation experiments on open-source and proprietary models and analyze the performance of both humans and models. We provide code and data at: https://github.com/yiqingxyq/CodeBenchGen.

View on arXiv PDF Code

Similar