CLSep 14, 2023

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram

Microsoft

arXiv:2309.07462v224.6141 citationsh-index: 39

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of scaling multilingual evaluation for NLP researchers, but it is incremental as it focuses on calibrating existing methods rather than introducing a new paradigm.

The study tackled the problem of inadequate evaluation for large language models in languages beyond the top 20 by exploring GPT-4 as an evaluator, finding that calibration with 20K human judgments is necessary to address biases, especially in low-resource languages.

Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations. Employing LLMs as evaluators to rank or score other models' outputs emerges as a viable solution, addressing the constraints tied to human annotators and established benchmarks. In this study, we explore the potential of LLM-based evaluators, specifically GPT-4 in enhancing multilingual evaluation by calibrating them against $20$K human judgments across three text-generation tasks, five metrics, and eight languages. Our analysis reveals a bias in GPT4-based evaluators towards higher scores, underscoring the necessity of calibration with native speaker judgments, especially in low-resource and non-Latin script languages, to ensure accurate evaluation of LLM performance across diverse languages.

View on arXiv PDF

Similar