CLMay 26, 2023

Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

arXiv:2305.17179v1240 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work provides guidelines for model developers to choose tokenizers for specific applications, addressing a domain-specific problem in multilingual NLP.

The study assessed how vocabulary allocation and overlap in sub-word tokenizers affect multilingual language models, finding that vocabulary overlap harms tasks like POS and dependency parsing but benefits NER and sentence-level tasks, with language-specific token coverage significantly impacting word-level tasks.

Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers. Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks (POS, dependency tree labeling). In contrast, NER and sentence-level tasks (cross-lingual retrieval, NLI) benefit from sharing vocabulary. We also observe that the coverage of the language-specific tokens in the multilingual vocabulary significantly impacts the word-level tasks. Our study offers a deeper understanding of the role of tokenizers in multilingual language models and guidelines for future model developers to choose the most suitable tokenizer for their specific application before undertaking costly model pre-training

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes