CLOct 17, 2020

TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis

arXiv:2010.11091v12.444 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses the need for better text analysis tools for researchers and practitioners working with social media data, though it is incremental as it adapts an existing method to a new domain.

The authors tackled the problem of poor performance when applying existing language representation models to Twitter text by introducing TweetBERT, a domain-specific model pre-trained on millions of tweets, which outperformed traditional BERT models by over 7% on Twitter datasets.

Twitter is a well-known microblogging social site where users express their views and opinions in real-time. As a result, tweets tend to contain valuable information. With the advancements of deep learning in the domain of natural language processing, extracting meaningful information from tweets has become a growing interest among natural language researchers. Applying existing language representation models to extract information from Twitter does not often produce good results. Moreover, there is no existing language representation models for text analysis specific to the social media domain. Hence, in this article, we introduce two TweetBERT models, which are domain specific language presentation models, pre-trained on millions of tweets. We show that the TweetBERT models significantly outperform the traditional BERT models in Twitter text mining tasks by more than 7% on each Twitter dataset. We also provide an extensive analysis by evaluating seven BERT models on 31 different datasets. Our results validate our hypothesis that continuously training language models on twitter corpus help performance with Twitter.

View on arXiv PDF Code

Similar