CLNov 18, 2025

Subword Tokenization Strategies for Kurdish Word Embeddings

arXiv:2511.14696v1

Originality Synthesis-oriented

AI Analysis

This work addresses tokenization challenges for low-resource language processing, specifically Kurdish, offering insights into evaluation biases and method selection, though it is incremental as it applies existing methods to a new domain.

The study tackled the problem of tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches, finding that morpheme-based tokenization demonstrated superior embedding space organization and semantic structure when evaluated comprehensively, with BPE covering only 28.6% of test cases compared to 68.7% for morpheme-based.

We investigate tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches on morphological similarity preservation tasks. We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation and evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization. Our analysis reveals critical evaluation biases in tokenization comparison. While BPE initially appears superior in morphological similarity, it evaluates only 28.6\% of test cases compared to 68.7\% for morpheme model, creating artificial performance inflation. When assessed comprehensively, morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity levels. These findings highlight the importance of coverage-aware evaluation in low-resource language processing and offers different tokenization methods for low-resourced language processing.

View on arXiv PDF

Similar