CLJun 4, 2019

Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation

arXiv:1906.01569v131.21104 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses a practical problem for NLP practitioners by providing systematic guidance on embedding selection for multilingual sequence tagging tasks, though it is incremental as it builds on existing embedding methods.

The paper tackled the problem of choosing between pretrained contextual and non-contextual subword embeddings for multilingual NLP by conducting an extensive evaluation on named entity recognition and part-of-speech tagging across languages, finding that a combination of BERT, BPEmb, and character representations works best overall, with BERT excelling in medium- to high-resource languages but being outperformed by non-contextual embeddings in low-resource settings.

Pretrained contextual and non-contextual subword embeddings have become available in over 250 languages, allowing massively multilingual NLP. However, while there is no dearth of pretrained embeddings, the distinct lack of systematic evaluations makes it difficult for practitioners to choose between them. In this work, we conduct an extensive evaluation comparing non-contextual subword embeddings, namely FastText and BPEmb, and a contextual representation method, namely BERT, on multilingual named entity recognition and part-of-speech tagging. We find that overall, a combination of BERT, BPEmb, and character representations works best across languages and tasks. A more detailed analysis reveals different strengths and weaknesses: Multilingual BERT performs well in medium- to high-resource languages, but is outperformed by non-contextual subword embeddings in a low-resource setting.

View on arXiv PDF Code

Similar