CL IRApr 12

HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

arXiv:2604.1066580.5h-index: 6

Predicted impact top 66% in CL · last 90 daysOriginality Incremental advance

AI Analysis

Provides a resource-light, OOV-free tokenization method for Turkish retrieval tasks, outperforming larger models.

HeceTokenizer, a syllable-based tokenizer for Turkish, achieves 50.3% Recall@5 on TQuAD retrieval, surpassing a morphology-driven baseline's 46.92% with a 200x smaller model.

HeceTokenizer is a syllable-based tokenizer for Turkish that exploits the deterministic six-pattern phonological structure of the language to construct a closed, out-of-vocabulary (OOV)-free vocabulary of approximately 8,000 unique syllable types. A BERT-tiny encoder (1.5M parameters) is trained from scratch on a subset of Turkish Wikipedia using a masked language modeling objective and evaluated on the TQuAD retrieval benchmark using Recall@5. Combined with a fine-grained chunk-based retrieval strategy, HeceTokenizer achieves 50.3% Recall@5, surpassing the 46.92% reported by a morphology-driven baseline that uses a 200 times larger model. These results suggest that the phonological regularity of Turkish syllables provides a strong and resource-light inductive bias for retrieval tasks.

View on arXiv PDF

Similar