CLAINov 14, 2024

MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs

arXiv:2411.09492v12 citationsh-index: 1Has Code
Originality Synthesis-oriented
AI Analysis

This provides a specialized evaluation tool for advancing NLP in low-resource languages like Mongolian, though it is incremental as it adapts existing methods to a new context.

The paper tackles the challenge of evaluating large language models (LLMs) in low-resource languages like Mongolian by developing MM-Eval, a hierarchical benchmark dataset, and finds that models perform better on syntactic tasks than semantic ones, with knowledge tasks showing moderate decline.

Large language models (LLMs) excel in high-resource languages but face notable challenges in low-resource languages like Mongolian. This paper addresses these challenges by categorizing capabilities into language abilities (syntax and semantics) and cognitive abilities (knowledge and reasoning). To systematically evaluate these areas, we developed MM-Eval, a specialized dataset based on Modern Mongolian Language Textbook I and enriched with WebQSP and MGSM datasets. Preliminary experiments on models including Qwen2-7B-Instruct, GLM4-9b-chat, Llama3.1-8B-Instruct, GPT-4, and DeepseekV2.5 revealed that: 1) all models performed better on syntactic tasks than semantic tasks, highlighting a gap in deeper language understanding; and 2) knowledge tasks showed a moderate decline, suggesting that models can transfer general knowledge from high-resource to low-resource contexts. The release of MM-Eval, comprising 569 syntax, 677 semantics, 344 knowledge, and 250 reasoning tasks, offers valuable insights for advancing NLP and LLMs in low-resource languages like Mongolian. The dataset is available at https://github.com/joenahm/MM-Eval.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes