CLApr 29, 2024

Unknown Script: Impact of Script on Cross-Lingual Transfer

Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

arXiv:2404.18810v215.432 citationsh-index: 7Has CodeNAACL

Originality Incremental advance

AI Analysis

This addresses a practical problem for NLP researchers and practitioners working with low-resource languages by identifying key bottlenecks in cross-lingual transfer.

The paper investigates how the source language of a pre-trained model affects cross-lingual transfer performance, particularly when the target language uses an unknown script, and finds that the tokenizer is a more critical factor than shared script, language similarity, or model size.

Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often overlooked aspect in this domain: the influence of the source language of a language model on language transfer performance. We consider a case where the target language and its script are not part of the pre-trained model. We conduct a series of experiments on monolingual and multilingual models that are pre-trained on different tokenization methods to determine factors that affect cross-lingual transfer to a new language with a unique script. Our findings reveal the importance of the tokenizer as a stronger factor than the shared script, language similarity, and model size.

View on arXiv PDF Code

Similar