CLMar 27, 2024

CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists

arXiv:2403.18771v353 citationsh-index: 20EMNLP
Originality Incremental advance
AI Analysis

This addresses reliability issues in automated evaluation for text generation, offering a more consistent and interpretable method, though it is incremental as it builds on existing LLM-as-a-Judge approaches.

The paper tackled the problem of rating inconsistencies in LLM-as-a-Judge approaches for text generation evaluation by introducing CheckEval, a checklist-based framework that improved average agreement across evaluator models by 0.45 and reduced score variance.

Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes