CL AIApr 9

Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation

arXiv:2604.0879731.81 citationsh-index: 1

Predicted impact top 17% in CL · last 90 daysOriginality Incremental advance

AI Analysis

It provides a new evaluation method for assessing cultural alignment in LLMs, revealing that current models fail to capture the diversity of human moral interpretation across cultures.

This work introduces multilingual story moral generation as a culturally grounded evaluation task, finding that frontier LLMs like GPT-4o and Gemini produce semantically similar and preferred morals but lack cross-linguistic variation and value diversity compared to humans.

Stories are key to transmitting values across cultures, but their interpretation varies across linguistic and cultural contexts. Thus, we introduce multilingual story moral generation as a novel culturally grounded evaluation task. Using a new dataset of human-written story morals collected across 14 language-culture pairs, we compare model outputs with human interpretations via semantic similarity, a human preference survey, and value categorization. We show that frontier models such as GPT-4o and Gemini generate story morals that are semantically similar to human responses and preferred by human evaluators. However, their outputs exhibit markedly less cross-linguistic variation and concentrate on a narrower set of widely shared values. These findings suggest that while contemporary models can approximate central tendencies of human moral interpretation, they struggle to reproduce the diversity that characterizes human narrative understanding. By framing narrative interpretation as an evaluative task, this work introduces a new approach to studying cultural alignment in language models beyond static benchmarks or knowledge-based tests.

View on arXiv PDF

Similar