CLJul 13, 2023

MegaWika: Millions of reports and their sources across 50 diverse languages

arXiv:2307.07049v111 citationsh-index: 60
Originality Synthesis-oriented
AI Analysis

This provides a large-scale multilingual resource for researchers working on collaborative AI report generation, though it is incremental as it builds on existing Wikipedia data.

The authors introduced MegaWika, a dataset of 13 million Wikipedia articles in 50 languages with 71 million referenced sources, to support AI-assisted report generation. They processed it for applications like cross-lingual translation and semantic analysis, and provided baseline models for tasks such as cross-lingual question answering.

To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes