CLAug 8, 2023

Learning Evaluation Models from Large Language Models for Sequence Generation

arXiv:2308.04386v35 citationsh-index: 32
Originality Incremental advance
AI Analysis

This addresses the challenge of scarce labeled data for model-based evaluation metrics in sequence generation, offering a customizable solution for various real-world scenarios.

The paper tackles the problem of automatic evaluation for sequence generation by proposing CSEM, a method that uses large language models to generate labeled data for training evaluation models without human-labeled data, and shows that CSEM-trained metrics outperform traditional ones, improving sequence quality in experiments.

Automatic evaluation of sequence generation, traditionally reliant on metrics like BLEU and ROUGE, often fails to capture the semantic accuracy of generated text sequences due to their emphasis on n-gram overlap. A promising solution to this problem is to develop model-based metrics, such as BLEURT and COMET. However, these approaches are typically hindered by the scarcity of labeled evaluation data, which is necessary to train the evaluation models. In this work, we build upon this challenge by proposing the Customized Sequence Evaluation Metric (CSEM), a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development, thereby eliminating the need for human-labeled data. Additionally, we expand the scope of CSEM to support various evaluation types, including single-aspect, multi-aspect, reference-free, and reference-based evaluations, enabling the customization of metrics to suit diverse real-world scenarios. Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data. Further experiments in reinforcement learning and reranking show that metrics developed through CSEM outperform traditional evaluation metrics, leading to substantial improvements in sequence quality as evaluated by both commonly used metrics and ChatGPT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes