CLMar 29, 2024

Measuring Taiwanese Mandarin Language Understanding

arXiv:2403.20180v12 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This work addresses the need for localized evaluation benchmarks for Taiwanese Mandarin LLMs, though it is incremental as it adapts existing evaluation methods to a specific linguistic context.

The authors tackled the underrepresentation of Traditional Chinese in LLM evaluation by creating TMLU, a benchmark for Taiwanese Mandarin, and found that Chinese open-weight models perform worse than multilingual proprietary ones, with Taiwanese Mandarin models lagging behind Simplified-Chinese counterparts.

The evaluation of large language models (LLMs) has drawn substantial attention in the field recently. This work focuses on evaluating LLMs in a Chinese context, specifically, for Traditional Chinese which has been largely underrepresented in existing benchmarks. We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in LLMs, under the context of Taiwanese Mandarin. TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels. In addition, we curate chain-of-thought-like few-shot explanations for each subject to facilitate the evaluation of complex reasoning skills. To establish a comprehensive baseline, we conduct extensive experiments and analysis on 24 advanced LLMs. The results suggest that Chinese open-weight models demonstrate inferior performance comparing to multilingual proprietary ones, and open-weight models tailored for Taiwanese Mandarin lag behind the Simplified-Chinese counterparts. The findings indicate great headrooms for improvement, and emphasize the goal of TMLU to foster the development of localized Taiwanese-Mandarin LLMs. We release the benchmark and evaluation scripts for the community to promote future research.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes