Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study
Provides practical guidance for building reliable multilingual evaluation pipelines, addressing the challenge of extending LLM-based evaluators to low-resource languages.
This work explores strategies for developing multilingual LLMs-as-a-judge across English, Spanish, and Basque, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are better for out-of-domain settings. Fine-tuning on out-of-domain data can hurt performance.
Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and low-resource languages, considering instruction translation, monolingual versus multilingual supervision, and model size. For evaluation, we extend two existing meta-evaluation datasets to Basque and Spanish. Our results reveal key trade-offs: When in-domain data is available, fine-tuned smaller models can achieve performance comparable to proprietary models, whereas zero-shot evaluation with larger models proves more effective in out-of-domain settings. We also observe that fine-tuning on out-of-domain data can adversely affect model performance. These findings provide practical guidance for building efficient, reliable multilingual evaluation pipelines. The data and code are publicly available at hitz-zentroa/mJudge.