No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding
This work addresses the need for better benchmarks to evaluate cultural competency in AI systems, though it is incremental as it builds on existing QA frameworks by adding multi-hop complexity.
The authors tackled the problem of assessing cultural understanding in large language models by introducing ID-MoCQA, a multi-hop question answering dataset focused on Indonesian traditions, and found that state-of-the-art models show substantial gaps in cultural reasoning, particularly in nuanced inference tasks.
Understanding culture requires reasoning across context, tradition, and implicit social knowledge, far beyond recalling isolated facts. Yet most culturally focused question answering (QA) benchmarks rely on single-hop questions, which may allow models to exploit shallow cues rather than demonstrate genuine cultural reasoning. In this work, we introduce ID-MoCQA, the first large-scale multi-hop QA dataset for assessing the cultural understanding of large language models (LLMs), grounded in Indonesian traditions and available in both English and Indonesian. We present a new framework that systematically transforms single-hop cultural questions into multi-hop reasoning chains spanning six clue types (e.g., commonsense, temporal, geographical). Our multi-stage validation pipeline, combining expert review and LLM-as-a-judge filtering, ensures high-quality question-answer pairs. Our evaluation across state-of-the-art models reveals substantial gaps in cultural reasoning, particularly in tasks requiring nuanced inference. ID-MoCQA provides a challenging and essential benchmark for advancing the cultural competency of LLMs.