Self-supervised Document Clustering Based on BERT with Data Augment
This work addresses the problem of unsupervised and few-shot text clustering, offering improved performance for researchers and practitioners working with text data.
This paper proposes self-supervised contrastive learning (SCL) and few-shot contrastive learning (FCL) based on BERT for text clustering. SCL outperforms state-of-the-art unsupervised clustering for both short and long texts, and FCL with unsupervised data augmentation (UDA) further improves performance for short texts, approaching supervised learning results.
Contrastive learning is a promising approach to unsupervised learning, as it inherits the advantages of well-studied deep models without a dedicated and complex model design. In this paper, based on bidirectional encoder representations from transformers, we propose self-supervised contrastive learning (SCL) as well as few-shot contrastive learning (FCL) with unsupervised data augmentation (UDA) for text clustering. SCL outperforms state-of-the-art unsupervised clustering approaches for short texts and those for long texts in terms of several clustering evaluation measures. FCL achieves performance close to supervised learning, and FCL with UDA further improves the performance for short texts.