CLAIJan 5, 2024

German Text Embedding Clustering Benchmark

arXiv:2401.02709v1106 citationsh-index: 4KONVENS
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in German NLP resources for clustering tasks, though it is incremental as it builds on existing methods and benchmarks.

The authors tackled the lack of German resources for clustering text embeddings by introducing a benchmark to assess performance across domains, finding that reducing embedding dimensions can improve clustering and that continued pre-training yields significant gains for short texts.

This work introduces a benchmark assessing the performance of clustering German text embeddings in different domains. This benchmark is driven by the increasing use of clustering neural text embeddings in tasks that require the grouping of texts (such as topic modeling) and the need for German resources in existing benchmarks. We provide an initial analysis for a range of pre-trained mono- and multilingual models evaluated on the outcome of different clustering algorithms. Results include strong performing mono- and multilingual models. Reducing the dimensions of embeddings can further improve clustering. Additionally, we conduct experiments with continued pre-training for German BERT models to estimate the benefits of this additional training. Our experiments suggest that significant performance improvements are possible for short text. All code and datasets are publicly available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes