CLApr 29, 2024

Unknown Script: Impact of Script on Cross-Lingual Transfer

arXiv:2404.18810v232 citationsh-index: 7NAACL
Originality Incremental advance
AI Analysis

This addresses a practical problem for NLP researchers and practitioners working with low-resource languages by identifying key bottlenecks in cross-lingual transfer.

The paper investigates how the source language of a pre-trained model affects cross-lingual transfer performance, particularly when the target language uses an unknown script, and finds that the tokenizer is a more critical factor than shared script, language similarity, or model size.

Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often overlooked aspect in this domain: the influence of the source language of a language model on language transfer performance. We consider a case where the target language and its script are not part of the pre-trained model. We conduct a series of experiments on monolingual and multilingual models that are pre-trained on different tokenization methods to determine factors that affect cross-lingual transfer to a new language with a unique script. Our findings reveal the importance of the tokenizer as a stronger factor than the shared script, language similarity, and model size.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes