DL CLAug 5, 2025

MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources

Samuel Barham, Chandler May, Benjamin Van Durme

arXiv:2508.03828v13 citationsh-index: 15

Originality Synthesis-oriented

AI Analysis

This dataset addresses the problem of limited resources for multilingual fact checking and temporal analyses, though it is incremental as an upgrade from an existing dataset.

The authors tackled the need for a more extensive multilingual dataset by introducing MegaWika 2, which includes six times as many articles and twice as many fully scraped citations compared to the original, supporting fact checking and analyses across time and language.

We introduce MegaWika 2, a large, multilingual dataset of Wikipedia articles with their citations and scraped web sources; articles are represented in a rich data structure, and scraped source texts are stored inline with precise character offsets of their citations in the article text. MegaWika 2 is a major upgrade from the original MegaWika, spanning six times as many articles and twice as many fully scraped citations. Both MegaWika and MegaWika 2 support report generation research ; whereas MegaWika also focused on supporting question answering and retrieval applications, MegaWika 2 is designed to support fact checking and analyses across time and language.

View on arXiv PDF

Similar