CL AIAug 3, 2025

The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

Aamod Thakur, Ajay Nagpal, Atharva Savarkar, Kundeshwar Pundalik, Siddhesh Dosi, Piyush Sawarkar, Viraj Thakur, Rohit Saluja, Maunendra Sankar Desarkar, Ganesh Ramakrishnan

arXiv:2508.06533v16.72 citationsh-index: 16

Originality Incremental advance

AI Analysis

This work addresses tokenization inefficiencies for multilingual LLM developers, offering incremental improvements in efficiency and speed.

The paper tackled the problem of inefficient tokenization in multilingual large language models, particularly for Indic scripts, by proposing a novel data composition algorithm and pretokenization strategies that reduced the average token-to-word ratio by approximately 6% and achieved over 40% improvement against state-of-the-art models, leading to gains in model performance and inference speed.

While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs

View on arXiv PDF

Similar