CLMay 24, 2021

DaN+: Danish Nested Named Entities and Lexical Normalization

Barbara Plank, Kristian Nørgaard Jensen, Rob van der Goot

arXiv:2105.11301v135.1993 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses nested NER and lexical normalization for Danish, a less-resourced language, but is incremental as it builds on existing methods like BERT and multi-task learning.

The paper tackled the problem of cross-lingual cross-domain learning for Danish nested named entity recognition (NER) by introducing a new corpus and evaluating strategies, finding that multi-task learning and in-language BERT with lexical normalization performed best, with out-of-domain setups remaining challenging.

This paper introduces DaN+, a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization to support research on cross-lingual cross-domain learning for a less-resourced language. We empirically assess three strategies to model the two-layer Named Entity Recognition (NER) task. We compare transfer capabilities from German versus in-language annotation from scratch. We examine language-specific versus multilingual BERT, and study the effect of lexical normalization on NER. Our results show that 1) the most robust strategy is multi-task learning which is rivaled by multi-label decoding, 2) BERT-based NER models are sensitive to domain shifts, and 3) in-language BERT and lexical normalization are the most beneficial on the least canonical data. Our results also show that an out-of-domain setup remains challenging, while performance on news plateaus quickly. This highlights the importance of cross-domain evaluation of cross-lingual transfer.

View on arXiv PDF Code

Similar