CLDec 20, 2023

Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?

arXiv:2312.12683v258 citationsh-index: 7Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses the problem of making English-centric LLMs multilingual for downstream applications, offering an incremental improvement by optimizing fine-tuning efficiency.

The study investigates the minimal multilingual data needed during fine-tuning to enable English-centric large language models (LLMs) to generalize across languages, finding that as few as two to three languages are sufficient, with effectiveness depending on pretraining exposure and task type.

The vast majority of today's large language models (LLMs) are English-centric, having been pretrained predominantly on English text. Yet, in order to meet user expectations, models need to be able to respond appropriately in multiple languages once deployed in downstream applications. This requires strong cross-lingual transfer abilities. In this work, we investigate the minimal amount of multilinguality required during finetuning to elicit cross-lingual generalisation in English-centric LLMs. In experiments across four LLMs, we find that multilingual instruction tuning with as few as two to three languages is both necessary and sufficient to elicit effective cross-lingual generalisation, with the limiting factor being the degree to which a target language is seen during pretraining. Evaluations on five different tasks further reveal that multilingual instruction tuning is most beneficial for generative tasks that assume input/output language agreement, such as in chat settings, while being of less importance for highly structured classification-style tasks. Our code and data is available at https://github.com/ZurichNLP/multilingual-instruction-tuning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes