CLNov 12, 2025

Contextual morphologically-guided tokenization for Latin encoder models

arXiv:2511.09709v1h-index: 2
Originality Incremental advance
AI Analysis

This addresses tokenization issues for morphologically rich languages, offering a feasible alternative to improve language modeling performance, especially for low-resource languages, though it is incremental as it builds on existing linguistic resource integration.

The paper tackled the problem of suboptimal tokenization for morphologically rich languages like Latin by investigating morphologically-aware tokenization, finding that it improves overall performance on four downstream tasks, with gains most pronounced for out-of-domain texts.

Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources -- a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models' improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes