CLMar 16

Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

arXiv:2603.1478283.1h-index: 9
Predicted impact top 59% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This addresses inclusivity and cultural coverage issues in LLMs for users of local language varieties, but it is incremental as it builds on existing QA and multilingual research.

The study tackled the problem of information asymmetry in LLMs for local language varieties by constructing a QA dataset for Cantonese-Mandarin and Bavarian-German, finding that LLMs fail to answer questions based solely on local Wikipedia knowledge but improve with context and translation.

Large Language Models (LLMs) are becoming a common way for humans to seek knowledge, yet their coverage and reliability vary widely. Especially for local language varieties, there are large asymmetries, e.g., information in local Wikipedia that is absent from the standard variant. However, little is known about how well LLMs perform under such information asymmetry, especially on closely related languages. We manually construct a novel challenge question-answering (QA) dataset that captures knowledge conveyed on a local Wikipedia page, which is absent from their higher-resource counterparts-covering Mandarin Chinese vs. Cantonese and German vs. Bavarian. Our experiments show that LLMs fail to answer questions about information only in local editions of Wikipedia. Providing context from lead sections substantially improves performance, with further gains possible via translation. Our topical, geographic annotations, and stratified evaluations reveal the usefulness of local Wikipedia editions as sources of both regional and global information. These findings raise critical questions about inclusivity and cultural coverage of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes