LG CL IRJun 24, 2017

Semi-supervised Text Categorization Using Recursive K-means Clustering

Harsha S. Gowda, Mahamad Suhil, D. S. Guru, Lavanya Narayana Raju

arXiv:1706.07913v119 citations

Originality Incremental advance

AI Analysis

This addresses text categorization, a common task in natural language processing, but appears incremental as it builds on existing clustering and nearest neighbor techniques.

The paper tackles text document classification by proposing a semi-supervised method using recursive K-means clustering to label unlabeled data, achieving superior performance over recent state-of-the-art models on the 20Newsgroups dataset.

In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partition till a desired level partition is achieved such that each partition contains labeled documents of a single class. Once the desired clusters are obtained, the respective cluster centroids are considered as representatives of the clusters and the nearest neighbor rule is used for classifying an unknown text document. Series of experiments have been conducted to bring out the superiority of the proposed model over other recent state of the art models on 20Newsgroups dataset.

View on arXiv PDF

Similar