CLMay 23

The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty

arXiv:2605.2471824.8

Predicted impact top 6% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For multilingual NLP practitioners, this provides the first controlled tokenizer tax map across European languages, highlighting underrepresented language penalties.

The study measures tokenizer fertility across 25 European languages for ten foundation models, finding a 2.5x tax from English (1.2 tokens/word) to Greek/Maltese (~3.1), with Ukrainian paying a 15-18% penalty over cognate Slavic languages. Fertility rankings are domain-invariant (rho > 0.97), and few-shot effects are model-intrinsic.

Tokenizer fertility the number of tokens per word imposes a hidden cost on non-English NLP. We measure fertility for ten foundation models across 25 European languages on parallel text, producing the first controlled tokenizer tax map for the continent. The tax spans 2.5x from English (1.2 tokens/word) to Greek/Maltese (~3.1), following a clear hierarchy: Romance (1.5-1.7), Germanic (1.7-1.9), Slavic (2.2-2.5), Uralic/Baltic (2.7-3.0). Ukrainian (2.7) pays 15-18% more than cognate Slavic languages, reflecting underrepresentation in pre-training data. Fertility rankings are domain-invariant across three text registers (rho > 0.97). A subword analysis reveals that high-fertility tokenizers fragment morphological boundaries rather than preserving them. Cross-lingual few-shot evaluation on four Slavic languages shows that few-shot effects are model-intrinsic, not language-dependent. We release all measurements as a public dataset.

View on arXiv PDF

Similar