CLDec 15, 2019

A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings

Niels van der Heijden, Samira Abnar, Ekaterina Shutova

arXiv:1912.10169v11.716 citations

Originality Incremental advance

AI Analysis

This work addresses data scarcity for low-resource languages in NLP, but it is incremental as it builds on existing methods with comparative improvements.

The paper tackles the challenge of data scarcity in multilingual NLP by comparing state-of-the-art encoders and proposing a new method for multilingual contextualized word embeddings, showing it performs at or above SOTA in zero-shot transfer and improves knowledge sharing in joint training.

The lack of annotated data in many languages is a well-known challenge within the field of multilingual natural language processing (NLP). Therefore, many recent studies focus on zero-shot transfer learning and joint training across languages to overcome data scarcity for low-resource languages. In this work we (i) perform a comprehensive comparison of state-ofthe-art multilingual word and sentence encoders on the tasks of named entity recognition (NER) and part of speech (POS) tagging; and (ii) propose a new method for creating multilingual contextualized word embeddings, compare it to multiple baselines and show that it performs at or above state-of-theart level in zero-shot transfer settings. Finally, we show that our method allows for better knowledge sharing across languages in a joint training setting.

View on arXiv PDF

Similar