CL AI DCJun 10, 2016

WordNet2Vec: Corpora Agnostic Word Vectorization Method

Roman Bartusiak, Łukasz Augustyniak, Tomasz Kajdanowicz, Przemysław Kazienko, Maciej Piasecki

arXiv:1606.03335v14.821 citations

Originality Incremental advance

AI Analysis

This work addresses the need for corpora-agnostic word vectorization methods for tasks like classification and clustering, but it appears incremental as it builds on existing WordNet resources.

The paper tackles the problem of structuring textual content in big data by proposing WordNet2Vec, a method that creates word vectors from WordNet to represent words' roles in natural language, and demonstrates its usefulness by achieving improved sentiment analysis on the Amazon opinion dataset.

A complex nature of big data resources demands new methods for structuring especially for textual content. WordNet is a good knowledge source for comprehensive abstraction of natural language as its good implementations exist for many languages. Since WordNet embeds natural language in the form of a complex network, a transformation mechanism WordNet2Vec is proposed in the paper. It creates vectors for each word from WordNet. These vectors encapsulate general position - role of a given word towards all other words in the natural language. Any list or set of such vectors contains knowledge about the context of its component within the whole language. Such word representation can be easily applied to many analytic tasks like classification or clustering. The usefulness of the WordNet2Vec method was demonstrated in sentiment analysis, i.e. classification with transfer learning for the real Amazon opinion textual dataset.

View on arXiv PDF

Similar