CLApr 25, 2021

XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond

arXiv:2104.12250v2605 citations
Originality Synthesis-oriented
AI Analysis

It addresses the problem of limited multilingual NLP tools for noisy social media data like Twitter, providing a practical solution for researchers and practitioners, though it is incremental as it builds on existing XLM-R.

The paper tackles the lack of multilingual language models for Twitter by introducing XLM-T, a model pre-trained on millions of tweets in over thirty languages, and achieves strong baseline performance for sentiment analysis across eight languages.

Language models are ubiquitous in current NLP, and their multilingual capacity has recently attracted considerable attention. However, current analyses have almost exclusively focused on (multilingual variants of) standard benchmarks, and have relied on clean pre-training and task-specific corpora as multilingual signals. In this paper, we introduce XLM-T, a model to train and evaluate multilingual language models in Twitter. In this paper we provide: (1) a new strong multilingual baseline consisting of an XLM-R (Conneau et al. 2020) model pre-trained on millions of tweets in over thirty languages, alongside starter code to subsequently fine-tune on a target task; and (2) a set of unified sentiment analysis Twitter datasets in eight different languages and a XLM-T model fine-tuned on them.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes