CL LG NEDec 11, 2015

Words are not Equal: Graded Weighting Model for building Composite Document Vectors

arXiv:1512.03549v114.322 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of building composite document vectors for tasks like sentiment classification, offering incremental improvements with potential benefits for under-resourced languages.

The paper tackled the problem of composing document vectors from word vectors by proposing graded weighting schemes based on discriminatory relevance, such as tf-idf, which significantly improved performance over prior state-of-the-art methods. It achieved a 1.6% improvement on the IMDB dataset and a 7.01% improvement on Amazon reviews, with language-free models showing gains of up to 12% on other datasets.

Despite the success of distributional semantics, composing phrases from word vectors remains an important challenge. Several methods have been tried for benchmark tasks such as sentiment classification, including word vector averaging, matrix-vector approaches based on parsing, and on-the-fly learning of paragraph vectors. Most models usually omit stop words from the composition. Instead of such an yes-no decision, we consider several graded schemes where words are weighted according to their discriminatory relevance with respect to its use in the document (e.g., idf). Some of these methods (particularly tf-idf) are seen to result in a significant improvement in performance over prior state of the art. Further, combining such approaches into an ensemble based on alternate classifiers such as the RNN model, results in an 1.6% performance improvement on the standard IMDB movie review dataset, and a 7.01% improvement on Amazon product reviews. Since these are language free models and can be obtained in an unsupervised manner, they are of interest also for under-resourced languages such as Hindi as well and many more languages. We demonstrate the language free aspects by showing a gain of 12% for two review datasets over earlier results, and also release a new larger dataset for future testing (Singh,2015).

View on arXiv PDF

Similar