SECLAug 20, 2024

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

arXiv:2408.10718v236 citationsh-index: 19Has Code
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation benchmarks in code understanding for AI researchers, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the problem of evaluating large language models' code understanding by introducing CodeJudge-Eval, a benchmark that assesses models through code judging rather than generation, and found that even state-of-the-art models struggle with it.

Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our codes and benchmark are available at \url{https://github.com/CodeLLM-Research/CodeJudge-Eval}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes