Annotating Norwegian Language Varieties on Twitter for Part-of-Speech
This work addresses the problem of NLP for Norwegian language varieties on social media, which is incremental as it builds on existing POS tagging methods with new data.
The authors tackled the challenge of processing Norwegian Twitter data, which contains social media variation and dialectal diversity, by creating a novel POS-tagged dataset. They found that models trained on Universal Dependency data performed worse on this dataset, with Bokmål-trained models generally outperforming Nynorsk-trained ones, and some models showed comparable performance on dialectal tweets versus written standards.
Norwegian Twitter data poses an interesting challenge for Natural Language Processing (NLP) tasks. These texts are difficult for models trained on standardized text in one of the two Norwegian written forms (Bokmål and Nynorsk), as they contain both the typical variation of social media text, as well as a large amount of dialectal variety. In this paper we present a novel Norwegian Twitter dataset annotated with POS-tags. We show that models trained on Universal Dependency (UD) data perform worse when evaluated against this dataset, and that models trained on Bokmål generally perform better than those trained on Nynorsk. We also see that performance on dialectal tweets is comparable to the written standards for some models. Finally we perform a detailed analysis of the errors that models commonly make on this data.