Extending Multilingual BERT to Low-Resource Languages
This work addresses the problem of limited language coverage in multilingual models for NLP practitioners, offering a simple extension method that improves performance on both existing and new languages, though it is incremental in nature.
The paper tackled the limitation of Multilingual BERT to only 104 high-resource languages by proposing E-BERT, a method to extend it to low-resource languages, resulting in an average 6% F1 increase for existing languages and 23% F1 increase for new languages in Named Entity Recognition tasks.
Multilingual BERT (M-BERT) has been a huge success in both supervised and zero-shot cross-lingual transfer learning. However, this success has focused only on the top 104 languages in Wikipedia that it was trained on. In this paper, we propose a simple but effective approach to extend M-BERT (E-BERT) so that it can benefit any new language, and show that our approach benefits languages that are already in M-BERT as well. We perform an extensive set of experiments with Named Entity Recognition (NER) on 27 languages, only 16 of which are in M-BERT, and show an average increase of about 6% F1 on languages that are already in M-BERT and 23% F1 increase on new languages.