CLJun 21, 2025

Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights

arXiv:2506.17789v21 citationsh-index: 16
Originality Incremental advance
AI Analysis

It addresses tokenization fairness and efficiency for linguistically diverse languages, particularly in the Indian subcontinent, but is incremental as it builds on existing methods.

This paper tackles the problem of tokenization being skewed towards high-resource languages by evaluating strategies across 17 Indian languages, showing that low-resource languages benefit from tokenizers trained on related high-resource languages and quantifying trade-offs between algorithms and vocabulary sizes.

Tokenization plays a pivotal role in multilingual NLP. However, existing tokenizers are often skewed towards high-resource languages, limiting their effectiveness for linguistically diverse and morphologically rich languages such as those in the Indian subcontinent. This paper presents a comprehensive intrinsic evaluation of tokenization strategies across 17 Indian languages. We quantify the trade-offs between bottom-up and top-down tokenizer algorithms (BPE and Unigram LM), effects of vocabulary sizes, and compare strategies of multilingual vocabulary construction such as joint and cluster-based training. We also show that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages. Our study provides practical insights for building more fair, efficient, and linguistically informed tokenizers for multilingual NLP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes