CLMay 28, 2025

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

John Mendonça, Alon Lavie, Isabel Trancoso

arXiv:2505.22777v44.91 citationsh-index: 48Has Code

Originality Incremental advance

AI Analysis

This addresses the need for better evaluation benchmarks in multilingual open-domain dialogue systems, though it is incremental as it builds on existing LLM-based evaluation methods.

The authors tackled the problem of evaluating open-domain chatbots by introducing MEDAL, a framework for curating multilingual dialogue evaluation benchmarks, which revealed that current LLM judges fail to reliably detect nuanced issues like lack of empathy or commonsense.

Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. Using MEDAL, we uncover that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.

View on arXiv PDF Code

Similar