CLOct 7, 2020

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

arXiv:2010.03093v11027 citations
Originality Incremental advance
AI Analysis

This provides a new benchmark for evaluating cross-lingual summarization systems, addressing a domain-specific need in natural language processing.

The authors introduced WikiLingua, a large-scale multilingual dataset for cross-lingual abstractive summarization, and proposed a method that significantly outperforms baselines while being more cost-efficient during inference.

We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct crosslingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost efficient during inference.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes