The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics
This addresses the problem of developing efficient evaluation metrics for text generation tasks, but it is incremental as it builds on existing prompting methods within a restricted competition setting.
The paper introduced the Eval4NLP 2023 shared task to explore prompting and score extraction in large language models for evaluating machine translation and summarization, with the best systems achieving results on par with or surpassing recent reference-free metrics like GEMBA and Comet-Kiwi-XXL.
With an increasing number of parameters and pre-training data, generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation. Specifically, we propose a novel competition setting in which we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting. We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset. Notably, despite the task's restrictions, the best-performing systems achieve results on par with or even surpassing recent reference-free metrics developed using larger models, including GEMBA and Comet-Kiwi-XXL. Finally, as a separate track, we perform a small-scale human evaluation of the plausibility of explanations given by the LLMs.