CLMay 24, 2023

Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References

arXiv:2305.15067v336 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses a key bottleneck in NLG evaluation for researchers and practitioners, though it is incremental as it builds on existing benchmarks.

The paper tackles the problem of poor correlation between automatic and human evaluations in natural language generation (NLG) due to limited references, by proposing Div-Ref, a method that uses large language models to diversify references, resulting in significantly enhanced correlation.

Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually be expressed in different forms, and the evaluation with a single or few references may not accurately reflect the quality of the model's hypotheses. To address this issue, this paper presents a simple and effective method, named Div-Ref, to enhance existing evaluation benchmarks by enriching the number of references. We leverage large language models (LLMs) to diversify the expression of a single reference into multiple high-quality ones to cover the semantic space of the reference sentence as much as possible. We conduct comprehensive experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation. This idea is compatible with recent LLM-based evaluation which can similarly derive advantages from incorporating multiple references. We strongly encourage future generation benchmarks to include more references, even if they are generated by LLMs, which is once for all. We release all the code and data at https://github.com/RUCAIBox/Div-Ref to facilitate research.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes