IR CLSep 14, 2012

Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

arXiv:1209.3126v124 citations

Originality Incremental advance

AI Analysis

This addresses the curse of dimensionality in text summarization preprocessing, offering a domain-specific improvement for summarization systems.

The paper tackles the problem of high dimensionality in automatic text summarization by proposing Ultra-stemming, a method that reduces words to their initial letters, and shows that it improves system performance on trilingual corpora, with results confirmed by automatic evaluation using Fresa.

In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduce the space of representation. We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved. Summaries on trilingual corpora were evaluated automatically with Fresa. Results confirm an increase in the performance, regardless of summarizer system used.

View on arXiv PDF

Similar