ML CL LGJun 8, 2018

Text Classification based on Word Subspace with Term-Frequency

Erica K. Shimomoto, Lincon S. Souza, Bernardo B. Gatto, Kazuhiro Fukui

arXiv:1806.03125v13.516 citationsh-index: 20

Originality Incremental advance

AI Analysis

This is an incremental improvement for text classification tasks, addressing semantic representation issues in existing methods.

The paper tackled the lack of semantic meaning in bag-of-words features for text classification by proposing a word subspace concept to model text from word vectors, achieving competitive results on the Reuters database compared to state-of-the-art algorithms.

Text classification has become indispensable due to the rapid increase of text in digital form. Over the past three decades, efforts have been made to approach this task using various learning algorithms and statistical models based on bag-of-words (BOW) features. Despite its simple implementation, BOW features lack semantic meaning representation. To solve this problem, neural networks started to be employed to learn word vectors, such as the word2vec. Word2vec embeds word semantic structure into vectors, where the angle between vectors indicates the meaningful similarity between words. To measure the similarity between texts, we propose the novel concept of word subspace, which can represent the intrinsic variability of features in a set of word vectors. Through this concept, it is possible to model text from word vectors while holding semantic information. To incorporate the word frequency directly in the subspace model, we further extend the word subspace to the term-frequency (TF) weighted word subspace. Based on these new concepts, text classification can be performed under the mutual subspace method (MSM) framework. The validity of our modeling is shown through experiments on the Reuters text database, comparing the results to various state-of-art algorithms.

View on arXiv PDF

Similar