CLJul 3, 2020

Playing with Words at the National Library of Sweden -- Making a Swedish BERT

arXiv:2007.01658v1140 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This addresses the lack of high-quality NLP resources for smaller languages like Swedish, though it is incremental as it applies an existing method to a new language-specific dataset.

The paper tackles the problem of creating a Swedish-specific BERT model, resulting in KB-BERT outperforming existing models like Arbetsförmedlingen's and Google's M-BERT in tasks such as named entity recognition and part-of-speech tagging.

This paper introduces the Swedish BERT ("KB-BERT") developed by the KBLab for data-driven research at the National Library of Sweden (KB). Building on recent efforts to create transformer-based BERT models for languages other than English, we explain how we used KB's collections to create and train a new language-specific BERT model for Swedish. We also present the results of our model in comparison with existing models - chiefly that produced by the Swedish Public Employment Service, Arbetsförmedlingen, and Google's multilingual M-BERT - where we demonstrate that KB-BERT outperforms these in a range of NLP tasks from named entity recognition (NER) to part-of-speech tagging (POS). Our discussion highlights the difficulties that continue to exist given the lack of training data and testbeds for smaller languages like Swedish. We release our model for further exploration and research here: https://github.com/Kungbib/swedish-bert-models .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes