SE AIAug 10, 2025

Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes

Zhe Zhang, Runlin Liu, Aishan Liu, Xingyu Liu, Xiang Gao, Hailong Sun

arXiv:2508.07180v19.82 citationsh-index: 6

Originality Incremental advance

AI Analysis

This addresses the need for contamination-resistant and rigorous benchmarks for LLMs in software development, though it is incremental as it builds on existing benchmarking methods with new innovations.

The authors tackled the problem of evaluating large language models (LLMs) on real-world code generation by developing CODE2BENCH, a pipeline for dynamically constructing benchmarks from GitHub repositories, resulting in a benchmark with 1,163 tasks where models struggled on complex, cross-language tasks but performed better on Python-specific ones.

As large language models LLMs) become increasingly integrated into software development workflows, rigorously evaluating their performance on complex, real-world code generation tasks has become essential. However, existing benchmarks often suffer from data contamination and limited test rigor, constraining their ability to reveal model failures effectively. To address these, we present CODE2BENCH, a end-to-end pipeline for dynamically constructing robust and contamination-resistant benchmarks from real-world GitHub repositories. Specifically, CODE2BENCH introduces three key innovations: (1) Automated Dynamism, achieved through periodic ingestion of recent code to minimize training data contamination; (2) Scope Graph-based dependency analysis, which enables structured classification of functions into benchmark instances with controlled dependency levels (distinguishing between Self-Contained (SC) tasks for cross-language evaluation and Weakly Self-Contained (WSC) tasks involving permitted library usage); and (3) Property-Based Testing (PBT) for the automated synthesis of rigorous test suites to enable thorough functional verification. Using this pipeline, we construct CODE2BENCH-2505, the first benchmark derived from 880 recent Python projects spanning diverse domains, comprising 1,163 code generation tasks with 100% average branch coverage on ground-truth implementations. Extensive evaluation of 16 LLMs using CODE2BENCH-2505 reveals that models consistently struggle with SC tasks requiring complex, non-standard logic and cross-language transfer, while showing relatively stronger performance on WSC tasks in Python. Our work introduces a contamination-resistant, language-agnostic methodology for dynamic benchmark construction, offering a principled foundation for the comprehensive and realistic evaluation of LLMs on real-world software development tasks.

View on arXiv PDF

Similar