CLApr 3, 2018

Incorporating Word Embeddings into Open Directory Project based Large-scale Classification

Kang-Min Kim, Aliyeva Dinara, Byung-Ju Choi, SangKeun Lee

arXiv:1804.00828v10.31 citations

Originality Incremental advance

AI Analysis

This addresses the problem of limited performance in large-scale text classification for applications relying on knowledge bases like ODP, though it is incremental by combining existing approaches.

The paper tackles the challenge of large-scale text classification by integrating word embeddings into Open Directory Project (ODP)-based methods, resulting in improvements of 10% in macro-averaging F1-score and 28% in precision at k over state-of-the-art techniques.

Recently, implicit representation models, such as embedding or deep learning, have been successfully adopted to text classification task due to their outstanding performance. However, these approaches are limited to small- or moderate-scale text classification. Explicit representation models are often used in a large-scale text classification, like the Open Directory Project (ODP)-based text classification. However, the performance of these models is limited to the associated knowledge bases. In this paper, we incorporate word embeddings into the ODP-based large-scale classification. To this end, we first generate category vectors, which represent the semantics of ODP categories by jointly modeling word embeddings and the ODP-based text classification. We then propose a novel semantic similarity measure, which utilizes the category and word vectors obtained from the joint model and word embeddings, respectively. The evaluation results clearly show the efficacy of our methodology in large-scale text classification. The proposed scheme exhibits significant improvements of 10% and 28% in terms of macro-averaging F1-score and precision at k, respectively, over state-of-the-art techniques.

View on arXiv PDF

Similar