CLNov 29, 2022

Extending the Subwording Model of Multilingual Pretrained Models for New Languages

arXiv:2211.15965v14 citationsh-index: 44
Originality Synthesis-oriented
AI Analysis

This addresses the challenge of adapting fixed-vocabulary models for low-resource languages like Inuktitut, but it is incremental as it builds on existing tokenizer modification techniques.

The paper tackled the problem of extending multilingual pretrained models to new languages by adding new subwords to the tokenizer without altering existing languages, achieving application to English-Inuktitut translation using mBART-50.

Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes