CLSep 15, 2022

TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations at Twitter

Amazon
arXiv:2209.07562v3120 citationsh-index: 23Has Code
Originality Incremental advance
AI Analysis

This addresses the need for better language models for noisy, user-generated text on social media platforms like Twitter, though it is incremental as it builds on existing PLM frameworks with added social data.

The authors tackled the problem of pre-trained language models not being tailored to noisy social media text by introducing TwHIN-BERT, a multilingual model trained on 7 billion tweets with social engagement objectives, which showed significant metric improvements on multilingual social recommendation and semantic understanding tasks.

Pre-trained language models (PLMs) are fundamental for natural language processing applications. Most existing PLMs are not tailored to the noisy user-generated text on social media, and the pre-training does not factor in the valuable social engagement logs available in a social network. We present TwHIN-BERT, a multilingual language model productionized at Twitter, trained on in-domain data from the popular social network. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision, but also with a social objective based on the rich social engagements within a Twitter heterogeneous information network (TwHIN). Our model is trained on 7 billion tweets covering over 100 distinct languages, providing a valuable representation to model short, noisy, user-generated text. We evaluate our model on various multilingual social recommendation and semantic understanding tasks and demonstrate significant metric improvement over established pre-trained language models. We open-source TwHIN-BERT and our curated hashtag prediction and social engagement benchmark datasets to the research community.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes