CLMay 30, 2025

Disentangling Language and Culture for Evaluating Multilingual Large Language Models

arXiv:2505.24635v114 citationsh-index: 5Has CodeACL
Originality Incremental advance
AI Analysis

This work addresses the need for culturally and linguistically nuanced evaluations of LLMs, challenging the assumption of uniform performance across languages, though it is incremental in proposing a new evaluation framework rather than a model improvement.

The paper tackles the problem of evaluating multilingual large language models by introducing a Dual Evaluation Framework that separates linguistic and cultural dimensions, revealing a 'CulturalLinguistic Synergy' where models perform better when questions are culturally aligned with the language, with specific neuron activation proportions serving as potential indicators.

This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs. By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions within both native and cross-cultural contexts cross-lingually. Extensive evaluations are conducted on a wide range of models, revealing a notable "CulturalLinguistic Synergy" phenomenon, where models exhibit better performance when questions are culturally aligned with the language. This phenomenon is further explored through interpretability probing, which shows that a higher proportion of specific neurons are activated in a language's cultural context. This activation proportion could serve as a potential indicator for evaluating multilingual performance during model training. Our findings challenge the prevailing notion that LLMs, primarily trained on English data, perform uniformly across languages and highlight the necessity of culturally and linguistically model evaluations. Our code can be found at https://yingjiahao14. github.io/Dual-Evaluation/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes