SEAICLSep 29, 2024

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

CMU
arXiv:2409.19801v216 citationsh-index: 9Has Code
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation metrics in automated code review, which is a one-to-many problem, by providing a more sensitive and aligned tool for researchers and practitioners, though it is incremental as it builds on existing methods.

The authors tackled the problem of automated evaluation of code review comments by developing CRScore, a reference-free metric that measures dimensions like conciseness, comprehensiveness, and relevance, achieving a 0.54 Spearman correlation with human judgment and releasing a corpus of 2.9k annotated reviews.

The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff). Furthermore, code review is a one-to-many problem, like generation and summarization, with many "valid reviews" for a diff. Thus, we develop CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment among open source metrics (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.9k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes