CLOct 30, 2023

The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics

Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, Steffen Eger

arXiv:2310.19792v120.7115 citationsh-index: 16Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of developing efficient evaluation metrics for text generation tasks, but it is incremental as it builds on existing prompting methods within a restricted competition setting.

The paper introduced the Eval4NLP 2023 shared task to explore prompting and score extraction in large language models for evaluating machine translation and summarization, with the best systems achieving results on par with or surpassing recent reference-free metrics like GEMBA and Comet-Kiwi-XXL.

With an increasing number of parameters and pre-training data, generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation. Specifically, we propose a novel competition setting in which we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting. We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset. Notably, despite the task's restrictions, the best-performing systems achieve results on par with or even surpassing recent reference-free metrics developed using larger models, including GEMBA and Comet-Kiwi-XXL. Finally, as a separate track, we perform a small-scale human evaluation of the plausibility of explanations given by the LLMs.

View on arXiv PDF Code

Similar