CLApr 8

Curation and Extraction of Drug-Related Entities from Reddit Platform

Zewei Wang, Zihan Xu, Yishu Wei, Michael Chary, Yifan Peng

arXiv:2605.2644559.1

AI Analysis

This work provides a specialized dataset and benchmarks for extracting drug-related information from social media, which could help physicians understand real-world drug use, but the results are incremental.

The authors created ReDose, a dataset of 6,435 Reddit posts annotated with drug, dose, and effect entities, and benchmarked extraction models. BiomedBERT achieved an F1 of 0.843 for DRUG, while Llama-3 70B outperformed GPT-4 (F1=0.79 vs 0.72), but EFFECT extraction remains challenging (GPT-4 recall 0.41).

Physicians learn primarily about illicit drugs from clinical overdose cases, limiting their understanding of real-world usage. Meanwhile, drug users share first-hand experiences online, offering insights into dosage and effects of drugs. To bridge this gap, we introduce ReDose (REddit Drug DOSe and Effect), a dataset of 6,435 Reddit posts on substance use. A board-certified toxicologist primarily annotated both the training and test sets, while two medical science students contributed to the test set, labeling DRUG, DOSE, and EFFECT entities. We benchmarked 6,267 annotations using BERT-based, large language model (LLM)-based, and Retrieval-Augmented Generation (RAG) models. BiomedBERT achieved an F1-score of 0.843 for DRUG, while Llama-3 70B outperformed GPT-4 (F1 = 0.79 vs. 0.72). EFFECT extraction remains challenging, with GPT-4 achieving a recall of 0.41. ReDose captures patient-curated narratives to advance medical data extraction from social media.

View on arXiv PDF

Similar