Vec2GC -- A Graph Based Clustering Method for Text Representations
This addresses the need for improved unsupervised document processing in NLP, though it appears incremental as it builds on existing graph-based and density-based clustering techniques.
The paper tackles the problem of unsupervised clustering for terms or documents in NLP pipelines with limited labeled data by introducing Vec2GC, a density-based method using community detection on weighted graphs from text representations, achieving results such as hierarchical clustering capabilities.
NLP pipelines with limited or no labeled data, rely on unsupervised methods for document processing. Unsupervised approaches typically depend on clustering of terms or documents. In this paper, we introduce a novel clustering algorithm, Vec2GC (Vector to Graph Communities), an end-to-end pipeline to cluster terms or documents for any given text corpus. Our method uses community detection on a weighted graph of the terms or documents, created using text representation learning. Vec2GC clustering algorithm is a density based approach, that supports hierarchical clustering as well.