CLJan 19

Reducing Tokenization Premiums for Low-Resource Languages

arXiv:2601.13328v11.1

Originality Incremental advance

AI Analysis

This addresses the issue of increased API and energy costs for low-resource language users, though it is incremental as it builds on existing pre-trained models.

The paper tackled the problem of tokenization premiums in low-resource languages, where sentences require more tokens than in English, leading to higher costs and reduced context windows. They proposed a method to reduce these premiums by adding tokens to the vocabulary, showing that compressed inputs often have similar hidden states to original ones in the Llama 3.2 1B model.

Relative to English, low-resource languages suffer from substantial tokenization premiums in modern LMs, meaning that it generally requires several times as many tokens to encode a sentence in a low-resource language than to encode the analogous sentence in English. This tokenization premium results in increased API and energy costs and reduced effective context windows for these languages. In this paper we analyze the tokenizers of ten popular LMs to better understand their designs and per-language tokenization premiums. We also propose a mechanism to reduce tokenization premiums in pre-trained models, by post-hoc additions to the token vocabulary that coalesce multi-token characters into single tokens. We apply this methodology to 12 low-resource languages, demonstrating that the original and compressed inputs often have similar last hidden states when run through the Llama 3.2 1B model.

View on arXiv PDF

Similar