XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models
This work addresses the vocabulary limitation in multilingual models, improving performance for low-resource languages, though it is an incremental advancement over existing methods.
The paper tackled the vocabulary bottleneck in multilingual language models by introducing XLM-V, a model with a one-million-token vocabulary that de-emphasizes token sharing between languages, leading to shorter and more meaningful tokenizations. XLM-V outperformed XLM-R on tasks like natural language inference, question answering, and named entity recognition, with absolute gains of up to 11.2% on low-resource language benchmarks.
Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This \textit{vocabulary bottleneck} limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), to named entity recognition (WikiAnn). XLM-V is particularly effective on low-resource language tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and Americas NLI, respectively.