CLOct 13, 2024

A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model

Shengxiang Gao, Fang nan, Yongbing Zhang, Yuxin Huang, Kaiwen Tan, Zhengtao Yu

arXiv:2410.09773v19.112 citationsh-index: 14Has CodeNAACL

Originality Incremental advance

AI Analysis

This addresses the problem of summarizing international news across multiple languages for researchers and practitioners, though it is incremental as it builds on existing summarization frameworks.

The authors tackled the lack of datasets for mixed-language multi-document news summarization by constructing MLMD-news, containing 10,992 source-target pairs in four languages, and proposed a graph-based extract-generate model to benchmark methods.

Existing research on news summarization primarily focuses on single-language single-document (SLSD), single-language multi-document (SLMD) or cross-language single-document (CLSD). However, in real-world scenarios, news about a international event often involves multiple documents in different languages, i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news is of great significance. However, the lack of datasets for MLMD news summarization has constrained the development of research in this area. To fill this gap, we construct a mixed-language multi-document news summarization dataset (MLMD-news), which contains four different languages and 10,992 source document cluster and target summary pairs. Additionally, we propose a graph-based extract-generate model and benchmark various methods on the MLMD-news dataset and publicly release our dataset and code\footnote[1]{https://github.com/Southnf9/MLMD-news}, aiming to advance research in summarization within MLMD scenarios.

View on arXiv PDF Code

Similar