CLMay 19, 2024

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

arXiv:2405.11430v329 citationsh-index: 26Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of insufficient benchmarking for AI researchers and developers, though it is incremental as it builds on existing evaluation methods.

The study tackled the inadequacy of existing benchmarks for evaluating function-level code generation in large language models by introducing the Mostly Hard Python Problems (MHPP) dataset, which revealed that many high-performing models on HumanEval failed to achieve similar success on MHPP, highlighting previously undiscovered limitations.

Recent advancements in large language models (LLMs) have greatly improved code generation, specifically at the function level. For instance, GPT-4o has achieved a 91.0\% pass rate on HumanEval. However, this draws into question the adequacy of existing benchmarks in thoroughly assessing function-level code generation capabilities. Our study analyzed two common benchmarks, HumanEval and MBPP, and found that these might not thoroughly evaluate LLMs' code generation capacities due to limitations in quality, difficulty, and granularity. To resolve this, we introduce the Mostly Hard Python Problems (MHPP) dataset, consisting of 210 unique human-curated problems. By focusing on the combination of natural language and code reasoning, MHPP gauges LLMs' abilities to comprehend specifications and restrictions, engage in multi-step reasoning, and apply coding knowledge effectively. Initial evaluations of 26 LLMs using MHPP showed many high-performing models on HumanEval failed to achieve similar success on MHPP. Moreover, MHPP highlighted various previously undiscovered limitations within various LLMs, leading us to believe that it could pave the way for a better understanding of LLMs' capabilities and limitations. MHPP, evaluation pipeline, and leaderboard can be found in https://github.com/SparksofAGI/MHPP.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes