CLOct 15, 2025

CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models

arXiv:2510.14014v11 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation of cultural understanding in AI systems, particularly for building culturally adaptive language models, though it is incremental as it builds on existing multilingual evaluation methods.

The authors tackled the problem of evaluating cultural reasoning in multilingual language models by introducing CRaFT, an explanation-based framework that assesses models using metrics like Cultural Fluency and Consistency, applied to 50 questions across three languages, revealing significant cross-lingual variations such as Arabic reducing fluency and Bengali enhancing it.

Correct answers do not necessarily reflect cultural understanding. We introduce CRaFT, an explanation-based multilingual evaluation framework designed to assess how large language models (LLMs) reason across cultural contexts. Rather than scoring outputs solely based on accuracy, CRaFT evaluates model explanations using four interpretable metrics: Cultural Fluency, Deviation, Consistency, and Linguistic Adaptation. We apply the framework to 50 culturally grounded questions from the World Values Survey, translated into Arabic, Bengali, and Spanish, and evaluate three models (GPT, DeepSeek, and FANAR) across over 2,100 answer-explanation pairs. Results reveal significant cross-lingual variation in reasoning: Arabic reduces fluency, Bengali enhances it, and Spanish remains largely stable. While GPT adapts more effectively across languages, it exhibits lower consistency; FANAR shows stable but rigid reasoning. These findings suggest that cultural awareness in LLMs is not intrinsic but emerges through linguistic framing. CRaFT offers a new lens for evaluating cross-cultural reasoning in multilingual settings, providing actionable insights for building culturally adaptive language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes