CLLGJan 25, 2023

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

UW
arXiv:2301.10472v2173 citationsh-index: 116
Originality Incremental advance
AI Analysis

This work addresses the vocabulary limitation in multilingual models, improving performance for low-resource languages, though it is an incremental advancement over existing methods.

The paper tackled the vocabulary bottleneck in multilingual language models by introducing XLM-V, a model with a one-million-token vocabulary that de-emphasizes token sharing between languages, leading to shorter and more meaningful tokenizations. XLM-V outperformed XLM-R on tasks like natural language inference, question answering, and named entity recognition, with absolute gains of up to 11.2% on low-resource language benchmarks.

Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This \textit{vocabulary bottleneck} limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), to named entity recognition (WikiAnn). XLM-V is particularly effective on low-resource language tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and Americas NLI, respectively.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes