CL LGJan 25, 2023

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa

arXiv:2301.10472v225.5173 citationsh-index: 116

Originality Incremental advance

AI Analysis

This work addresses the vocabulary limitation in multilingual models, improving performance for low-resource languages, though it is an incremental advancement over existing methods.

The paper tackled the vocabulary bottleneck in multilingual language models by introducing XLM-V, a model with a one-million-token vocabulary that de-emphasizes token sharing between languages, leading to shorter and more meaningful tokenizations. XLM-V outperformed XLM-R on tasks like natural language inference, question answering, and named entity recognition, with absolute gains of up to 11.2% on low-resource language benchmarks.

Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This \textit{vocabulary bottleneck} limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), to named entity recognition (WikiAnn). XLM-V is particularly effective on low-resource language tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and Americas NLI, respectively.

View on arXiv PDF

Similar