CLFeb 19, 2024

Are LLM-based Evaluators Confusing NLG Quality Criteria?

arXiv:2402.12055v246 citationsh-index: 9ACL
AI Analysis

This addresses a reliability problem for researchers and practitioners using LLM-based evaluators in NLG tasks, but it is incremental as it builds on prior work showing LLMs perform well in evaluation.

The paper tackles the problem of LLMs confusing different natural language generation (NLG) quality criteria during evaluation, which reduces their reliability. The results reveal inherent confusion issues and other noteworthy phenomena in LLMs, necessitating further research and improvements.

Some prior work has shown that LLMs perform well in NLG evaluation for different tasks. However, we discover that LLMs seem to confuse different evaluation criteria, which reduces their reliability. For further verification, we first consider avoiding issues of inconsistent conceptualization and vague expression in existing NLG quality criteria themselves. So we summarize a clear hierarchical classification system for 11 common aspects with corresponding different criteria from previous studies involved. Inspired by behavioral testing, we elaborately design 18 types of aspect-targeted perturbation attacks for fine-grained analysis of the evaluation behaviors of different LLMs. We also conduct human annotations beyond the guidance of the classification system to validate the impact of the perturbations. Our experimental results reveal confusion issues inherent in LLMs, as well as other noteworthy phenomena, and necessitate further research and improvements for LLM-based evaluation.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes