CLSep 26, 2024

LangSAMP: Language-Script Aware Multilingual Pretraining

arXiv:2409.18199v22 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of achieving language neutrality in multilingual models for NLP applications, representing an incremental improvement over existing methods.

The paper tackles the problem of language-specific information burden in multilingual pretrained language models by proposing LangSAMP, which incorporates language and script embeddings, resulting in improved zero-shot crosslingual transfer across diverse tasks and enhanced language-neutral representations.

Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings -- learnable vectors assigned to individual languages. However, this places a significant burden on token representations to encode all language-specific information, which may hinder language neutrality. To address this limitation, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning. Specifically, we integrate these embeddings into the output of the Transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline in zero-shot crosslingual transfer across diverse downstream tasks. Extensive analysis reveals that language and script embeddings capture language- and script-specific nuances, which benefits more language-neutral representations, proven by improved pairwise cosine similarity. In our case study, we also show that language and script embeddings can be used to select better source languages for crosslingual transfer. We make our code and models publicly available at https://github.com/cisnlp/LangSAMP.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes