CLAISep 26, 2025

KurdSTS: The Kurdish Semantic Textual Similarity

arXiv:2510.02336v1h-index: 7
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in NLP for low-resource languages, specifically Kurdish, by providing a foundational dataset and benchmarks, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of semantic textual similarity resources for Kurdish by creating the first Kurdish STS dataset of 10,000 annotated sentence pairs, achieving competitive results with benchmarked models like Sentence-BERT and multilingual BERT.

Semantic Textual Similarity (STS) measures the degree of meaning overlap between two texts and underpins many NLP tasks. While extensive resources exist for high-resource languages, low-resource languages such as Kurdish remain underserved. We present, to our knowledge, the first Kurdish STS dataset: 10,000 sentence pairs spanning formal and informal registers, each annotated for similarity. We benchmark Sentence-BERT, multilingual BERT, and other strong baselines, obtaining competitive results while highlighting challenges arising from Kurdish morphology, orthographic variation, and code-mixing. The dataset and baselines establish a reproducible evaluation suite and provide a strong starting point for future research on Kurdish semantics and low-resource NLP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes