CLJan 26, 2021

First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT

arXiv:2101.11109v1810 citations
Originality Incremental advance
AI Analysis

This provides insights into how multilingual models work, aiding researchers in improving cross-lingual NLP, though it is incremental as it builds on existing understanding of BERT.

The paper investigates the source of cross-lingual transfer in multilingual BERT, revealing through layer ablation and representation analysis that it functions as a multilingual encoder followed by a task-specific predictor, with the encoder being key to transfer and largely unchanged during fine-tuning.

Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes