CLOct 2, 2020

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

arXiv:2010.01150v11000 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the need for efficient data selection in pretraining for social media applications, but it is incremental as it builds on existing domain-specific BERT approaches.

The study tackled the problem of selecting cost-effective pretraining data for domain-specific BERT models by pretraining on social media text (tweets and forum text) and demonstrated its effectiveness, with models publicly released.

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes