CLJul 22, 2021

Evaluation of contextual embeddings on less-resourced languages

Matej Ulčar, Aleš Žagar, Carlos S. Armendariz, Andraž Repar, Senja Pollak, Matthew Purver, Marko Robnik-Šikonja

arXiv:2107.10614v11.212 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the gap in NLP research for non-English languages, providing empirical benchmarks for practitioners working with less-resourced languages, though it is incremental as it compares existing methods on new data.

The paper tackled the problem of evaluating contextual embeddings like ELMo and BERT on less-resourced languages, finding that monolingual BERT models generally perform best in monolingual settings, while cross-lingual settings favor BERT models trained on few languages, with specific exceptions such as dependency parsing where ELMo models excel.

The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysis shows that monolingual BERT models generally dominate, with a few exceptions such as the dependency parsing task, where they are not competitive with ELMo models trained on large corpora. In cross-lingual settings, BERT models trained on only a few languages mostly do best, closely followed by massively multilingual BERT models.

View on arXiv PDF

Similar