CLFeb 2

Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages

Tjaša Arčon, Matej Klemen, Marko Robnik-Šikonja, Kaja Dobrovoljc

arXiv:2602.02182v10.6h-index: 17Has Code

Originality Incremental advance

AI Analysis

This reveals a critical gap in LLMs' ability to reason about language structure rather than just use language, particularly for low-resource languages where performance is substantially lower.

The researchers evaluated how well large language models understand linguistic structure across diverse languages, finding that current models have limited metalinguistic knowledge with GPT-4o achieving only moderate accuracy (0.367) and all models failing to outperform simple baselines.

Large language models (LLMs) are routinely evaluated on language use tasks, yet their knowledge of linguistic structure remains poorly understood. Existing linguistic benchmarks typically focus on narrow phenomena, emphasize high-resource languages, and rarely evaluate metalinguistic knowledge-explicit reasoning about language structure rather than language use. Using accuracy and macro F1, together with majority-class and chance baselines, we analyse overall performance and examine variation by linguistic domains and language-related factors. Our results show that metalinguistic knowledge in current LLMs is limited: GPT-4o performs best but achieves only moderate accuracy (0.367), while open-source models lag behind. All models perform above chance but fail to outperform the majority-class baseline, suggesting they capture cross-linguistic patterns but lack fine-grained grammatical distinctions. Performance varies across linguistic domains, with lexical features showing the highest accuracy and phonological features among the lowest, partially reflecting differences in online visibility. At the language level, accuracy shows a strong association with digital language status: languages with higher digital presence and resource availability are evaluated more accurately, while low-resource languages show substantially lower performance. Analyses of predictive factors confirm that resource-related indicators (Wikipedia size, corpus availability) are more informative predictors of accuracy than geographical, genealogical, or sociolinguistic factors. Together, these results suggest that LLMs' metalinguistic knowledge is fragmented and shaped by data availability rather than generalizable grammatical competence across the world's languages. We release our benchmark as an open-source dataset to support systematic evaluation and encourage greater global linguistic diversity in future LLMs.

View on arXiv PDF

Similar