CL AI LGJun 25, 2024

CaLMQA: Exploring culturally specific long-form question answering across 23 languages

Shane Arora, Marzena Karpinska, Hung-Ting Chen, Ipsita Bhattacharjee, Mohit Iyyer, Eunsol Choi

arXiv:2406.17761v314.136 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses the gap in evaluating LLMs for culturally specific QA across diverse languages, which is important for improving global AI accessibility and accuracy, though it is incremental as it focuses on dataset creation and benchmarking.

The paper tackled the problem of evaluating large language models (LLMs) on culturally specific long-form question answering across 23 languages by creating the CaLMQA dataset of 51.7K questions, finding that models make critical errors, especially for low-resource languages, and that answers to culturally specific questions contain more factual errors than culturally agnostic ones.

Despite rising global usage of large language models (LLMs), their ability to generate long-form answers to culturally specific questions remains unexplored in many languages. To fill this gap, we perform the first study of textual multilingual long-form QA by creating CaLMQA, a dataset of 51.7K culturally specific questions across 23 different languages. We define culturally specific questions as those that refer to concepts unique to one or a few cultures, or have different answers depending on the cultural or regional context. We obtain these questions by crawling naturally-occurring questions from community web forums in high-resource languages, and by hiring native speakers to write questions in under-resourced, rarely-studied languages such as Fijian and Kirundi. Our data collection methodologies are translation-free, enabling the collection of culturally unique questions like "Kuber iki umwami wa mbere w'uburundi yitwa Ntare?" (Kirundi; English translation: "Why was the first king of Burundi called Ntare (Lion)?"). We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers, finding that (1) for many languages, even the best models make critical surface-level errors (e.g., answering in the wrong language, repetition), especially for low-resource languages; and (2) answers to culturally specific questions contain more factual errors than answers to culturally agnostic questions -- questions that have consistent meaning and answer across many cultures. We release CaLMQA to facilitate future research in cultural and multilingual long-form QA.

View on arXiv PDF Code

Similar