IR CLApr 9, 2013

Corpus-based Web Document Summarization using Statistical and Linguistic Approach

Rushdi Shams, M. M. A. Hashem, Afrina Hossain, Suraiya Rumana Akter, Monika Gope

arXiv:1304.2476v18 citations

Originality Incremental advance

AI Analysis

This addresses summarization for domain-specific web documents, offering an incremental improvement by integrating corpus-based statistical and linguistic features.

The paper tackles single-document summarization for domain-specific web text by combining statistical and linguistic analysis with a reference corpus, using a novel ranking function based on sentence and subject weights. Results show that 68% of the generated summaries align with manual summaries from human evaluators.

Single document summarization generates summary by extracting the representative sentences from the document. In this paper, we presented a novel technique for summarization of domain-specific text from a single web document that uses statistical and linguistic analysis on the text in a reference corpus and the web document. The proposed summarizer uses the combinational function of Sentence Weight (SW) and Subject Weight (SuW) to determine the rank of a sentence, where SW is the function of number of terms (t_n) and number of words (w_n) in a sentence, and term frequency (t_f) in the corpus and SuW is the function of t_n and w_n in a subject, and t_f in the corpus. 30 percent of the ranked sentences are considered to be the summary of the web document. We generated three web document summaries using our technique and compared each of them with the summaries developed manually from 16 different human subjects. Results showed that 68 percent of the summaries produced by our approach satisfy the manual summaries.

View on arXiv PDF

Similar