CLMay 19, 2025

EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

MIT
arXiv:2505.13004v124 citationsh-index: 34Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of inefficient code generation by LLMs for developers and researchers, though it is incremental as it builds on existing benchmarks by adding efficiency and multi-language support.

The paper tackles the lack of benchmarks for evaluating code efficiency in LLM-generated code across multiple languages, introducing EffiBench-X and finding that LLMs achieve only about 62% of human efficiency on average, with significant variations by language.

Existing code generation benchmarks primarily evaluate functional correctness, with limited focus on code efficiency and often restricted to a single language like Python. To address this gap, we introduce EffiBench-X, the first multi-language benchmark designed to measure the efficiency of LLM-generated code. EffiBench-X supports Python, C++, Java, JavaScript, Ruby, and Golang. It comprises competitive programming tasks with human-expert solutions as efficiency baselines. Evaluating state-of-the-art LLMs on EffiBench-X reveals that while models generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM-generated solutions (Qwen3-32B) achieve only around \textbf{62\%} of human efficiency on average, with significant language-specific variations. LLMs show better efficiency in Python, Ruby, and JavaScript than in Java, C++, and Golang. For instance, DeepSeek-R1's Python code is significantly more efficient than its Java code. These results highlight the critical need for research into LLM optimization techniques to improve code efficiency across diverse languages. The dataset and evaluation infrastructure are submitted and available at https://github.com/EffiBench/EffiBench-X.git and https://huggingface.co/datasets/EffiBench/effibench-x.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes