Ontology Based Document Clustering Using MapReduce
This work addresses the challenge of handling data-intensive document clustering for applications dealing with large document collections, though it is incremental as it builds on existing methods with ontology integration.
The paper tackles the problem of clustering large-scale documents by proposing a distributed bisecting k-means implementation using MapReduce and integrating WordNet ontology to leverage semantic relations. The results show that using lexical categories for nouns reduces document features from thousands to tens and improves internal evaluation measures.
Nowadays, document clustering is considered as a data intensive task due to the dramatic, fast increase in the number of available documents. Nevertheless, the features that represent those documents are also too large. The most common method for representing documents is the vector space model, which represents document features as a bag of words and does not represent semantic relations between words. In this paper we introduce a distributed implementation for the bisecting k-means using MapReduce programming model. The aim behind our proposed implementation is to solve the problem of clustering intensive data documents. In addition, we propose integrating the WordNet ontology with bisecting k-means in order to utilize the semantic relations between words to enhance document clustering results. Our presented experimental results show that using lexical categories for nouns only enhances internal evaluation measures of document clustering; and decreases the documents features from thousands to tens features. Our experiments were conducted using Amazon Elastic MapReduce to deploy the Bisecting k-means algorithm.