SEAICLMay 27, 2025

An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks

arXiv:2505.20854v29 citationsh-index: 14
Originality Incremental advance
AI Analysis

This provides a scalable and reliable alternative to human evaluation for software engineering tasks like code generation, program repair, and code summarization, though it is incremental as it builds on existing LLM-as-judge methods.

The paper tackles the challenge of accurately assessing the correctness of LLM-generated software artifacts by introducing SE-Jury, an LLM-as-Ensemble-Judge metric that improves correlation with human judgments by 29.6% to 140.8% over existing automatic metrics.

Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, many automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SE-Jury, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SE-Jury first defines five distinct evaluation strategies, each implemented by an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges as a team to produce a final correctness score through ensembling. We evaluate SE-Jury across a diverse set of software engineering (SE) benchmarks that span three popular SE tasks: code generation, automated program repair, and code summarization. Results demonstrate that SE-Jury consistently achieves a higher correlation with human judgments, with improvements ranging from 29.6% to 140.8% over existing automatic metrics. SE-Jury reaches agreement levels with human annotators that are close to inter-annotator agreement in code generation and program repair. These findings underscore SE-Jury's potential as a scalable and reliable alternative to human evaluation in these SE tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes