CL LG MLFeb 12, 2025

Summaries as Centroids for Interpretable and Scalable Text Clustering

arXiv:2502.09667v314.78 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the need for human-readable and auditable cluster prototypes in text clustering, offering a scalable solution for applications like streaming text analysis, though it is incremental as it builds on k-means with novel summarization techniques.

The paper tackled the problem of making text clustering more interpretable and scalable by introducing k-NLPmeans and k-LLMmeans, which replace numeric centroids with textual summaries, resulting in consistent outperformance of classical baselines and competitive accuracy with recent LLM-based methods without extensive LLM calls.

We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.

View on arXiv PDF

Similar