CLOct 19, 2024

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

arXiv:2410.15037v251 citationsh-index: 9Volume 1
Originality Incremental advance
AI Analysis

This addresses the problem of evaluating code generation models for low-resource language prompts, which is important for developers and researchers working on multilingual AI systems, though it is incremental as it extends an existing benchmark.

The authors tackled the limitation of existing code generation benchmarks that focus primarily on English-to-Python tasks by introducing mHumanEval, a multilingual benchmark supporting prompts in over 200 natural languages, and found that state-of-the-art Code LLMs show varying performance across languages.

Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes