CLAISIMay 30, 2025

Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings

arXiv:2506.00277v13 citationsh-index: 37ACL
Originality Incremental advance
AI Analysis

This addresses the need for scalable and interpretable clustering in multilingual news and social media data, though it appears incremental by combining existing embedding and clustering ideas.

The paper tackles the problem of poor scalability, opacity, and multilingual struggles in clustering news articles by introducing a hierarchical, multilingual approach using Matryoshka embeddings, achieving state-of-the-art performance with a Pearson ρ of 0.816 on the SemEval 2022 Task 8 test dataset.

Contextual large language model embeddings are increasingly utilized for topic modeling and clustering. However, current methods often scale poorly, rely on opaque similarity metrics, and struggle in multilingual settings. In this work, we present a novel, scalable, interpretable, hierarchical, and multilingual approach to clustering news articles and social media data. To do this, we first train multilingual Matryoshka embeddings that can determine story similarity at varying levels of granularity based on which subset of the dimensions of the embeddings is examined. This embedding model achieves state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson $ρ$ = 0.816). Once trained, we develop an efficient hierarchical clustering algorithm that leverages the hierarchical nature of Matryoshka embeddings to identify unique news stories, narratives, and themes. We conclude by illustrating how our approach can identify and cluster stories, narratives, and overarching themes within real-world news datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes