Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages
This addresses data scarcity for summarization in under-represented languages, though it is incremental in applying existing digitization resources to a new task.
The paper tackles the scarcity of summarization data in low-resource languages by developing a method to automatically collect naturally occurring summaries from digitized historical newspapers using front-page teasers, producing HEBTEASESUM as the first dedicated multi-document summarization dataset in Hebrew.
High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.