CLFeb 10

Effective vocabulary expanding of multilingual language models for extremely low-resource languages

arXiv:2602.09388v1
Originality Incremental advance
AI Analysis

This work addresses the challenge of supporting extremely low-resource languages in NLP, though it is incremental as it builds on existing continued pre-training methods.

The paper tackles the problem of extending multilingual pre-trained language models to previously unsupported low-resource languages by expanding the vocabulary with target language data and using bilingual dictionaries for initialization, resulting in improvements of 0.54% in POS tagging and 2.60% in NER over a baseline.

Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model's vocabulary using a target language corpus. We then screen out a subset from the model's original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses randomly initialized expanded vocabulary for continued pre-training, in POS tagging and NER tasks, achieving improvements by 0.54% and 2.60%, respectively. Furthermore, our method demonstrates high robustness in selecting the training corpora, and the models' performance on the source language does not degrade after continued pre-training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes