CLFeb 23, 2025

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

arXiv:2502.16614v111 citationsh-index: 16Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the need for better evaluation of LLMs' critique abilities in code domains, but it is incremental as it builds on existing critique benchmarks by extending them to more tasks and dimensions.

The authors tackled the problem of evaluating the critique capacity of Large Language Models (LLMs) on code tasks by introducing CodeCriticBench, a holistic benchmark that includes code generation and code QA with varying difficulties and comprehensive evaluation protocols, showing its effectiveness through extensive experiments on existing LLMs.

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes