CLSep 4, 2025

MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

arXiv:2509.04111v27 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This provides a new benchmark for multilingual reading comprehension, but it is incremental as it extends existing datasets to more languages.

The authors introduced MultiWikiQA, a reading comprehension dataset covering 306 languages using Wikipedia articles and LLM-generated questions, and found that it is challenging with large performance gaps across languages when evaluated on 6 language models.

We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an LLM and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.

View on arXiv PDF

Similar