CLFeb 5

Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

arXiv:2602.05648v11.62 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work addresses the problem of modeling morphological complexity in low-resource languages for NLP researchers, but it is incremental as it builds on existing tokenization and paradigm analysis methods.

The study examined how transformer models represent complex verb paradigms in Turkish and Hebrew, finding that tokenization strategies significantly affect performance, with monolingual models using morpheme-aware segmentation performing well for Hebrew's non-concatenative morphology, while both monolingual and multilingual models succeeded for Turkish's transparent morphology.

We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.

View on arXiv PDF

Similar