SE CLAug 20, 2024

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, Jing Ma

arXiv:2408.10718v224.736 citationsh-index: 19Has Code

Originality Incremental advance

AI Analysis

This addresses the need for better evaluation benchmarks in code understanding for AI researchers, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the problem of evaluating large language models' code understanding by introducing CodeJudge-Eval, a benchmark that assesses models through code judging rather than generation, and found that even state-of-the-art models struggle with it.

Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our codes and benchmark are available at \url{https://github.com/CodeLLM-Research/CodeJudge-Eval}.

View on arXiv PDF Code

Similar