PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs
This addresses the need for more reliable evaluation of code-generating LLMs for developers and researchers, though it is incremental as it builds on prior benchmarks.
The authors identified biases and limited difficulty in existing benchmarks for code-generating LLMs, and introduced PythonSaga, a new benchmark with 185 prompts across 38 concepts, which revealed poor performance by current models.
Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks, potentially inflating model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs.