CLAILGJun 25, 2024

CaLMQA: Exploring culturally specific long-form question answering across 23 languages

arXiv:2406.17761v336 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the gap in evaluating LLMs for culturally specific QA across diverse languages, which is important for improving global AI accessibility and accuracy, though it is incremental as it focuses on dataset creation and benchmarking.

The paper tackled the problem of evaluating large language models (LLMs) on culturally specific long-form question answering across 23 languages by creating the CaLMQA dataset of 51.7K questions, finding that models make critical errors, especially for low-resource languages, and that answers to culturally specific questions contain more factual errors than culturally agnostic ones.

Despite rising global usage of large language models (LLMs), their ability to generate long-form answers to culturally specific questions remains unexplored in many languages. To fill this gap, we perform the first study of textual multilingual long-form QA by creating CaLMQA, a dataset of 51.7K culturally specific questions across 23 different languages. We define culturally specific questions as those that refer to concepts unique to one or a few cultures, or have different answers depending on the cultural or regional context. We obtain these questions by crawling naturally-occurring questions from community web forums in high-resource languages, and by hiring native speakers to write questions in under-resourced, rarely-studied languages such as Fijian and Kirundi. Our data collection methodologies are translation-free, enabling the collection of culturally unique questions like "Kuber iki umwami wa mbere w'uburundi yitwa Ntare?" (Kirundi; English translation: "Why was the first king of Burundi called Ntare (Lion)?"). We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers, finding that (1) for many languages, even the best models make critical surface-level errors (e.g., answering in the wrong language, repetition), especially for low-resource languages; and (2) answers to culturally specific questions contain more factual errors than answers to culturally agnostic questions -- questions that have consistent meaning and answer across many cultures. We release CaLMQA to facilitate future research in cultural and multilingual long-form QA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes