CLOct 30, 2023

The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics

arXiv:2310.19792v1115 citationsh-index: 16
Originality Synthesis-oriented
AI Analysis

This addresses the problem of developing efficient evaluation metrics for text generation tasks, but it is incremental as it builds on existing prompting methods within a restricted competition setting.

The paper introduced the Eval4NLP 2023 shared task to explore prompting and score extraction in large language models for evaluating machine translation and summarization, with the best systems achieving results on par with or surpassing recent reference-free metrics like GEMBA and Comet-Kiwi-XXL.

With an increasing number of parameters and pre-training data, generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation. Specifically, we propose a novel competition setting in which we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting. We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset. Notably, despite the task's restrictions, the best-performing systems achieve results on par with or even surpassing recent reference-free metrics developed using larger models, including GEMBA and Comet-Kiwi-XXL. Finally, as a separate track, we perform a small-scale human evaluation of the plausibility of explanations given by the LLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes