CL LGOct 24, 2022

LANS: Large-scale Arabic News Summarization Corpus

Abdulaziz Alhamadani, Xuchao Zhang, Jianfeng He, Chang-Tien Lu

arXiv:2210.13600v116.5131 citationsh-index: 41

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited resources for Arabic summarization researchers, though it is incremental as it focuses on dataset creation rather than novel methods.

The authors tackled the lack of large and diverse datasets for Arabic Text Summarization by building LANS, a corpus with 8.4 million articles and summaries from 1999 to 2019, achieving 95.4% accuracy in human evaluation.

Text summarization has been intensively studied in many languages, and some languages have reached advanced stages. Yet, Arabic Text Summarization (ATS) is still in its developing stages. Existing ATS datasets are either small or lack diversity. We build, LANS, a large-scale and diverse dataset for Arabic Text Summarization task. LANS offers 8.4 million articles and their summaries extracted from newspapers websites metadata between 1999 and 2019. The high-quality and diverse summaries are written by journalists from 22 major Arab newspapers, and include an eclectic mix of at least more than 7 topics from each source. We conduct an intrinsic evaluation on LANS by both automatic and human evaluations. Human evaluation of 1000 random samples reports 95.4% accuracy for our collected summaries, and automatic evaluation quantifies the diversity and abstractness of the summaries. The dataset is publicly available upon request.

View on arXiv PDF

Similar