IR SIMar 31, 2020

A large-scale Twitter dataset for drug safety applications mined from publicly existing resources

arXiv:2003.13900v14.311 citations

Originality Synthesis-oriented

AI Analysis

This provides a scalable public dataset for pharmacovigilance researchers to identify adverse drug reactions, addressing a bottleneck in the field.

The researchers tackled the problem of limited large-scale social media data for drug safety applications by repurposing a publicly available archive of 9.4 billion tweets to create a dataset of 1,181,993 million drug usage-related tweets, validated using existing curated datasets and machine learning methods.

With the increase in popularity of deep learning models for natural language processing (NLP) tasks, in the field of Pharmacovigilance, more specifically for the identification of Adverse Drug Reactions (ADRs), there is an inherent need for large-scale social-media datasets aimed at such tasks. With most researchers allocating large amounts of time to crawl Twitter or buying expensive pre-curated datasets, then manually annotating by humans, these approaches do not scale well as more and more data keeps flowing in Twitter. In this work we re-purpose a publicly available archived dataset of more than 9.4 billion Tweets with the objective of creating a very large dataset of drug usage-related tweets. Using existing manually curated datasets from the literature, we then validate our filtered tweets for relevance using machine learning methods, with the end result of a publicly available dataset of 1,181,993 million tweets for public use. We provide all code and detailed procedure on how to extract this dataset and the selected tweet ids for researchers to use.

View on arXiv PDF

Similar