IRApr 5, 2017

Part of Speech Based Term Weighting for Information Retrieval

arXiv:1704.01617v18.243 citations

Originality Incremental advance

AI Analysis

This work addresses information retrieval for users by enhancing term weighting with linguistic features, but it is incremental as it extends existing statistical approximations.

The authors tackled the problem of improving information retrieval by proposing part-of-speech (POS) n-gram statistics to compute term weights, which measure how informative terms are based on their POS contexts, and experiments on TREC collections with 300 queries showed gains of up to +33.7% over TF-IDF and BM25 baselines.

Automatic language processing tools typically assign to terms so-called weights corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the POS contexts in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline.

View on arXiv PDF

Similar