CLFeb 21, 2024

OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models

arXiv:2402.13524v13 citationsh-index: 21Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of assessing multilingual capabilities in LLMs for researchers and developers, but it is incremental as it extends existing evaluation frameworks to new languages.

The authors tackled the lack of multilingual evaluation benchmarks for large language models by introducing OMGEval, an open-source test set with 804 questions per language across five languages, which they used to evaluate several models and provide a reference for the community.

Modern large language models (LLMs) should generally benefit individuals from various cultural backgrounds around the world. However, most recent advanced generative evaluation benchmarks tailed for LLMs mainly focus on English. To this end, we introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages. For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs, such as general knowledge, logical reasoning, and so on. Each question is rigorously verified by human annotators. Notably, to sufficiently reflect the compatibility of LLMs in different cultural backgrounds, we perform localization for each non-English language. Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar). Following AlpacaEval, we employ GPT-4 as the adjudicator to automatically score different model outputs, which is shown closely related to human evaluation. We evaluate several representative multilingual LLMs on the proposed OMGEval, which we believe will provide a valuable reference for the community to further understand and improve the multilingual capability of LLMs. OMGEval is available at https://github.com/blcuicall/OMGEval.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes