CLApr 19, 2021

BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

arXiv:2104.09243v12.671 citations

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited NLP resources for these languages, though it is incremental as it adapts existing methods to new data.

The authors tackled the lack of transformer language models for Bosnian, Croatian, Montenegrin, and Serbian by pre-training BERTić on 8 billion tokens, resulting in improvements on tasks like part-of-speech tagging and named-entity recognition over state-of-the-art models.

In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains. We evaluate the transformer model on the tasks of part-of-speech tagging, named-entity-recognition, geo-location prediction and commonsense causal reasoning, showing improvements on all tasks over state-of-the-art models. For commonsense reasoning evaluation, we introduce COPA-HR -- a translation of the Choice of Plausible Alternatives (COPA) dataset into Croatian. The BERTić model is made available for free usage and further task-specific fine-tuning through HuggingFace.

View on arXiv PDF

Similar