CLMay 28, 2025

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

arXiv:2505.22777v41 citationsh-index: 48Has Code
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation benchmarks in multilingual open-domain dialogue systems, though it is incremental as it builds on existing LLM-based evaluation methods.

The authors tackled the problem of evaluating open-domain chatbots by introducing MEDAL, a framework for curating multilingual dialogue evaluation benchmarks, which revealed that current LLM judges fail to reliably detect nuanced issues like lack of empathy or commonsense.

Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. Using MEDAL, we uncover that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes