LG MLAug 28, 2020

An Intelligent CNN-VAE Text Representation Technology Based on Text Semantics for Comprehensive Big Data

Genggeng Liu, Canyang Guo, Lin Xie, Wenxi Liu, Naixue Xiong, Guolong Chen

arXiv:2008.12522v11.2

Originality Synthesis-oriented

AI Analysis

This work addresses text representation challenges in NLP for big data applications, but it is incremental as it combines existing methods like CNN and VAE.

The paper tackled the problem of extracting semantic features and distinguishing polysemy in text representation for big data by proposing a CNN-VAE model with improved word2vec input, achieving superior performance in text classification tasks using KNN, RF, and SVM algorithms.

In the era of big data, a large number of text data generated by the Internet has given birth to a variety of text representation methods. In natural language processing (NLP), text representation transforms text into vectors that can be processed by computer without losing the original semantic information. However, these methods are difficult to effectively extract the semantic features among words and distinguish polysemy in language. Therefore, a text feature representation model based on convolutional neural network (CNN) and variational autoencoder (VAE) is proposed to extract the text features and apply the obtained text feature representation on the text classification tasks. CNN is used to extract the features of text vector to get the semantics among words and VAE is introduced to make the text feature space more consistent with Gaussian distribution. In addition, the output of the improved word2vec model is employed as the input of the proposed model to distinguish different meanings of the same word in different contexts. The experimental results show that the proposed model outperforms in k-nearest neighbor (KNN), random forest (RF) and support vector machine (SVM) classification algorithms.

View on arXiv PDF

Similar