CLFeb 1

Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models

Reem I. Masoud, Chen Feng, Shunta Asano, Saied Alshahrani, Philip Colin Treleaven, Miguel R. D. Rodrigues

arXiv:2602.01161v10.6

Originality Incremental advance

AI Analysis

This research addresses cultural misalignment in LLMs for global deployment by identifying dataset properties that predict performance, though it is incremental as it builds on existing adaptation methods.

The study investigated how linguistic properties of fine-tuning datasets affect cultural alignment in large language models, finding that lexical-oriented components consistently improved performance across models and benchmarks, while semantic or diversity extremes were often neutral or harmful.

The global deployment of large language models (LLMs) has raised concerns about cultural misalignment, yet the linguistic properties of fine-tuning datasets used for cultural adaptation remain poorly understood. We adopt a dataset-centric view of cultural alignment and ask which linguistic properties of fine-tuning data are associated with cultural performance, whether these properties are predictive prior to training, and how these effects vary across models. We compute lightweight linguistic, semantic, and structural metrics for Arabic, Chinese, and Japanese datasets and apply principal component analysis separately within each language. This design ensures that the resulting components capture variation among datasets written in the same language rather than differences between languages. The resulting components correspond to broadly interpretable axes related to semantic coherence, surface-level lexical and syntactic diversity, and lexical or structural richness, though their composition varies across languages. We fine-tune three major LLM families (LLaMA, Mistral, DeepSeek) and evaluate them on benchmarks of cultural knowledge, values, and norms. While PCA components correlate with downstream performance, these associations are strongly model-dependent. Through controlled subset interventions, we show that lexical-oriented components (PC3) are the most robust, yielding more consistent performance across models and benchmarks, whereas emphasizing semantic or diversity extremes (PC1-PC2) is often neutral or harmful.

View on arXiv PDF

Similar