Identifying Necessary Elements for BERT's Multilinguality
This work addresses the obscure reasons behind mBERT's effectiveness for researchers in multilingual NLP, but it is incremental as it builds on prior studies to identify specific elements.
The study tackled the problem of understanding why multilingual BERT (mBERT) works well without crosslingual training by identifying necessary architectural and linguistic elements for multilinguality, resulting in the discovery of four architectural and two linguistic factors and showing that insights transfer to larger settings with experiments on XNLI in three languages.
It has been shown that multilingual BERT (mBERT) yields high quality multilingual representations and enables effective zero-shot transfer. This is surprising given that mBERT does not use any crosslingual signal during training. While recent literature has studied this phenomenon, the reasons for the multilinguality are still somewhat obscure. We aim to identify architectural properties of BERT and linguistic properties of languages that are necessary for BERT to become multilingual. To allow for fast experimentation we propose an efficient setup with small BERT models trained on a mix of synthetic and natural data. Overall, we identify four architectural and two linguistic elements that influence multilinguality. Based on our insights, we experiment with a multilingual pretraining setup that modifies the masking strategy using VecMap, i.e., unsupervised embedding alignment. Experiments on XNLI with three languages indicate that our findings transfer from our small setup to larger scale settings.