CL AI LGSep 6, 2018

An Analysis of Hierarchical Text Classification Using Word Embeddings

Roger A. Stein, Patricia A. Jaques, Joao F. Valiati

arXiv:1809.01771v13.7228 citations

Originality Synthesis-oriented

AI Analysis

This work addresses hierarchical text classification, a specific domain problem, and is incremental as it applies existing methods to a new context.

The study investigated the application of word embeddings and machine learning algorithms to hierarchical text classification, finding that FastText achieved an LCA F1 score of 0.893 on the RCV1 dataset, indicating promise for this approach.

Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates the application of those models and algorithms on this specific problem by means of experimentation and analysis. We trained classification models with prominent machine learning algorithm implementations---fastText, XGBoost, SVM, and Keras' CNN---and noticeable word embeddings generation methods---GloVe, word2vec, and fastText---with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an ${}_{LCA}F_1$ of 0.893 on a single-labeled version of the RCV1 dataset. An analysis indicates that using word embeddings and its flavors is a very promising approach for HTC.

View on arXiv PDF

Similar