MATA (māta): Mindful Assessment of the Telugu Abilities of Large Language Models
This work addresses the need for fine-grained evaluation of LLMs in low-resource languages like Telugu, which is incremental as it builds on existing multilingual NLP research.
The authors tackled the problem of evaluating large language models' abilities in Telugu by introducing MATA, a dataset of 729 questions, and found that models rely on superficial heuristics like answer position and distractor patterns, with performance varying across 11 models and LLM-as-a-judge evaluation showing reliability issues compared to human assessment.
In this paper, we introduce MATA, a novel evaluation dataset to assess the ability of Large Language Models (LLMs) in Telugu language, comprising 729 carefully curated multiple-choice and open-ended questions that span diverse linguistic dimensions. We evaluate 11 open-weight and closed-source LLMs on our dataset and present a fine-grained analysis of their performance. Further, we empirically show how LLMs rely on superficial heuristics such as answer position and distractor patterns for multiple-choice questions. Finally, we also compare LLM-as-a-judge evaluation with human evaluation for open-ended questions and draw some conclusions on its reliability in a low-resource language. We argue that such fine-grained evaluation is essential for understanding model limitations and can inform the development of more linguistically capable LLMs, while also serving as a foundation for future research in Telugu NLP.