IRCLMay 25, 2016

SS4MCT: A Statistical Stemmer for Morphologically Complex Texts

arXiv:1605.07852v25 citations
Originality Incremental advance
AI Analysis

This addresses stemming for information retrieval in morphologically complex languages, but it is incremental as it builds on existing statistical techniques.

The paper tackles the problem of stemming in morphologically complex texts by proposing a method to find statistical inflectional rules based on minimum edit distance and likelihoods, which significantly outperforms baselines in MAP on CLEF 2008 and 2009 English-Persian CLIR tasks.

There have been multiple attempts to resolve various inflection matching problems in information retrieval. Stemming is a common approach to this end. Among many techniques for stemming, statistical stemming has been shown to be effective in a number of languages, particularly highly inflected languages. In this paper we propose a method for finding affixes in different positions of a word. Common statistical techniques heavily rely on string similarity in terms of prefix and suffix matching. Since infixes are common in irregular/informal inflections in morphologically complex texts, it is required to find infixes for stemming. In this paper we propose a method whose aim is to find statistical inflectional rules based on minimum edit distance table of word pairs and the likelihoods of the rules in a language. These rules are used to statistically stem words and can be used in different text mining tasks. Experimental results on CLEF 2008 and CLEF 2009 English-Persian CLIR tasks indicate that the proposed method significantly outperforms all the baselines in terms of MAP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes