CLJun 11, 2023

RoBERTweet: A BERT Language Model for Romanian Tweets

arXiv:2306.06598v15 citationsh-index: 13
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of social media analysis for the Romanian NLP community by providing a specialized language model and dataset, though it is incremental as it adapts an existing architecture to a new language domain.

The authors tackled the lack of a Transformer model for Romanian tweets by introducing RoBERTweet, the first such model, which outperformed existing general-domain and multilingual models on emotion detection, sexist language identification, and named entity recognition tasks.

Developing natural language processing (NLP) systems for social media analysis remains an important topic in artificial intelligence research. This article introduces RoBERTweet, the first Transformer architecture trained on Romanian tweets. Our RoBERTweet comes in two versions, following the base and large architectures of BERT. The corpus used for pre-training the models represents a novelty for the Romanian NLP community and consists of all tweets collected from 2008 to 2022. Experiments show that RoBERTweet models outperform the previous general-domain Romanian and multilingual language models on three NLP tasks with tweet inputs: emotion detection, sexist language identification, and named entity recognition. We make our models and the newly created corpus of Romanian tweets freely available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes