CLDec 31, 2020

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

arXiv:2012.15613v2774 citations
AI Analysis

This work helps researchers and practitioners understand the performance trade-offs of multilingual models for specific languages, particularly concerning tokenizer design.

This paper systematically compares multilingual language models with their monolingual counterparts across nine diverse languages and five downstream tasks. They found that while pretraining data size is important, a designated monolingual tokenizer plays an equally crucial role in downstream performance, and replacing the multilingual tokenizer with a specialized monolingual one improved performance for almost every task and language.

In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes