CLLGSEMay 7, 2024

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

Tsinghua
arXiv:2405.04520v124 citationsh-index: 36Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of overfitting to simplified benchmarks for developers and researchers, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the mismatch between existing code synthesis benchmarks and real-world coding challenges by introducing NaturalCodeBench (NCB), a benchmark with 402 problems from natural user queries, and found that performance gaps on NCB between models with similar HumanEval scores are significant, with even GPT-4 performing unsatisfactorily.

Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at https://github.com/THUDM/NaturalCodeBench.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes