CLAIFeb 27, 2022

UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining

arXiv:2202.13469v1647 citations
Originality Incremental advance
AI Analysis

This addresses the need for context-aware phrase representations in topic mining without extensive annotations, though it appears incremental as it builds on contrastive learning with novel components.

The paper tackles the problem of learning high-quality phrase representations for topic mining by proposing UCTopic, an unsupervised contrastive learning framework that improves phrase representations and achieves a 38.2% average NMI gain over state-of-the-art models on entity clustering tasks.

High-quality phrase representations are essential to finding topics and related terms in documents (a.k.a. topic mining). Existing phrase representation learning methods either simply combine unigram representations in a context-free manner or rely on extensive annotations to learn context-aware knowledge. In this paper, we propose UCTopic, a novel unsupervised contrastive learning framework for context-aware phrase representations and topic mining. UCTopic is pretrained in a large scale to distinguish if the contexts of two phrase mentions have the same semantics. The key to pretraining is positive pair construction from our phrase-oriented assumptions. However, we find traditional in-batch negatives cause performance decay when finetuning on a dataset with small topic numbers. Hence, we propose cluster-assisted contrastive learning(CCL) which largely reduces noisy negatives by selecting negatives from clusters and further improves phrase representations for topics accordingly. UCTopic outperforms the state-of-the-art phrase representation model by 38.2% NMI in average on four entity cluster-ing tasks. Comprehensive evaluation on topic mining shows that UCTopic can extract coherent and diverse topical phrases.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes