CL LGNov 15, 2022

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

Pieter Delobelle, Thomas Winters, Bettina Berendt

arXiv:2211.08192v11.110 citationsh-index: 34

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of outdated language models for Dutch NLP users, though it is incremental as it builds on an existing model.

The authors updated the Dutch language model RobBERT to incorporate recent language changes, such as corona-related words, and found that this update significantly improved performance on certain language tasks.

Large transformer-based language models, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such language models are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this paper, we update RobBERT, a RoBERTa-based state-of-the-art Dutch language model, which was trained in 2019. First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus, e.g. corona-related words. Then we further pre-train the RobBERT model using this dataset. To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.We found that for certain language tasks this update results in a significant performance increase. These results highlight the benefit of continually updating a language model to account for evolving language use.

View on arXiv PDF

Similar