CLJan 28, 2021

Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter

Elena Zotova, Rodrigo Agerri, German Rigau

arXiv:2101.11978v11.825 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the problem of costly and limited annotated data for researchers in natural language processing, though it is incremental as it builds on existing annotation challenges.

The paper tackles the lack of multilingual datasets for stance detection in Twitter by proposing a semi-automatic method that leverages user-based information to label tweets, resulting in large, balanced corpora for monolingual and cross-lingual experimentation.

Popular social media networks provide the perfect environment to study the opinions and attitudes expressed by users. While interactions in social media such as Twitter occur in many natural languages, research on stance detection (the position or attitude expressed with respect to a specific topic) within the Natural Language Processing field has largely been done for English. Although some efforts have recently been made to develop annotated data in other languages, there is a telling lack of resources to facilitate multilingual and crosslingual research on stance detection. This is partially due to the fact that manually annotating a corpus of social media texts is a difficult, slow and costly process. Furthermore, as stance is a highly domain- and topic-specific phenomenon, the need for annotated data is specially demanding. As a result, most of the manually labeled resources are hindered by their relatively small size and skewed class distribution. This paper presents a method to obtain multilingual datasets for stance detection in Twitter. Instead of manually annotating on a per tweet basis, we leverage user-based information to semi-automatically label large amounts of tweets. Empirical monolingual and cross-lingual experimentation and qualitative analysis show that our method helps to overcome the aforementioned difficulties to build large, balanced and multilingual labeled corpora. We believe that our method can be easily adapted to easily generate labeled social media data for other Natural Language Processing tasks and domains.

View on arXiv PDF Code

Similar