The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews
This work addresses the problem of monitoring drug safety and effectiveness from user-generated content in Russian, providing a new dataset and baseline models for the natural language processing community, though it is incremental as it adapts existing methods to a new language and domain.
The authors introduced the Russian Drug Reaction Corpus (RuDReC), a dataset of 1.4 million raw and 500 labeled Russian user reviews for detecting drug reactions and effectiveness, and developed a baseline model achieving a macro F1 score of 74.85% in named entity recognition and 68.82% in sentence classification, with a 7.47% improvement over BERT.
The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. The corpus itself consists of two parts, the raw one and the labelled one. The raw part includes 1.4 million health-related user-generated texts collected from various Internet sources, including social media. The labelled part contains 500 consumer reviews about drug therapy with drug- and disease-related information. Labels for sentences include health-related issues or their absence. The sentences with one are additionally labelled at the expression level for identification of fine-grained subtypes such as drug classes and drug forms, drug indications, and drug reactions. Further, we present a baseline model for named entity recognition (NER) and multi-label sentence classification tasks on this corpus. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the sentence classification task, our model achieves the macro F1 score of 68.82% gaining 7.47% over the score of BERT model trained on Russian data. We make the RuDReC corpus and pretrained weights of domain-specific BERT models freely available at https://github.com/cimm-kzn/RuDReC