CLAIFeb 10, 2025

SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia

arXiv:2502.06298v118 citationsh-index: 32NAACL
Originality Highly original
AI Analysis

This work addresses the problem of accurately evaluating LLMs for users in Southeast Asia, particularly those who require models that can handle local languages and contexts, which is an incremental yet significant step for the region.

This study tackled the problem of evaluating Large Language Models (LLMs) in Southeast Asian application scenarios, resulting in the introduction of two novel benchmarks, SeaExam and SeaBench, which more effectively discern LLM performance on SEA language tasks. The benchmarks demonstrate the importance of using real-world queries to assess the multilingual capabilities of LLMs.

This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes