CLAug 10, 2015

Measuring Word Significance using Distributed Representations of Words

Adriaan M. J. Schakel, Benjamin J. Wilson

arXiv:1508.02297v1101 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of word significance analysis for researchers in natural language processing, but it is incremental as it builds on existing word2vec methods.

The paper tackled the problem of measuring word significance in text corpora by proposing to use the length of word vectors from word2vec, combined with term frequency, as a measure, and presented experimental evidence using a domain-specific corpus of abstracts to support this, resulting in a visualization technique for ranking words by significance.

Distributed representations of words as real-valued vectors in a relatively low-dimensional space aim at extracting syntactic and semantic features from large text corpora. A recently introduced neural network, named word2vec (Mikolov et al., 2013a; Mikolov et al., 2013b), was shown to encode semantic information in the direction of the word vectors. In this brief report, it is proposed to use the length of the vectors, together with the term frequency, as measure of word significance in a corpus. Experimental evidence using a domain-specific corpus of abstracts is presented to support this proposal. A useful visualization technique for text corpora emerges, where words are mapped onto a two-dimensional plane and automatically ranked by significance.

View on arXiv PDF

Similar