SEAIJul 30, 2024

TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models

arXiv:2407.21227v33 citationsh-index: 59
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation methods in code generation for researchers and practitioners, though it is incremental as it builds on existing benchmarking efforts.

The paper tackles the problem of assessing task difficulty in code generation benchmarks for Large Language Models (LLMs) by introducing TaskEval, a framework that uses diverse prompts and Item Response Theory to characterize tasks, showing it can identify topics and patterns related to difficulty across benchmarks like HumanEval+ and ClassEval with 8 LLMs.

Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single prompt, despite the formulation of prompts having a profound impact on the outcome. This paper introduces a generalist approach, TaskEval, a framework using diverse prompts and Item Response Theory (IRT) to efficiently assess LLMs' capabilities and benchmark task characteristics, improving the understanding of their performance. Using two code generation benchmarks, \textit{HumanEval}+ and \textit{ClassEval}, as well as 8 code generation LLMs, we show that \textit{TaskEval} is capable of characterising the properties of tasks. Using topic analysis, we identify and analyse the tasks of 17 and 21 topics within the benchmarks. We also cross-analyse tasks' characteristics with programming constructs (e.g., variable assignment, conditions, etc.) used by LLMs, emphasising some patterns with tasks' difficulty. Finally, we conduct a comparison between the difficulty assessment of tasks by human annotators and LLMs. Orthogonal to current benchmarking evaluation efforts, \textit{TaskEval} can assist researchers and practitioners in fostering better assessments of LLMs. The tasks' characteristics can be used to identify shortcomings within existing benchmarks or improve the evaluation of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes