A Novel Dataset Towards Extracting Virus-Host Interactions
This work addresses the need for automated extraction of host-pathogen detection methods from scientific literature, with potential applications in predicting viral spillover risk for human health.
The authors introduced a new manually annotated dataset for named-entity recognition (NER) focused on virus-host interactions, providing initial results using pre-trained models on this dataset.
We describe a novel dataset for the automated recognition of named taxonomic and other entities relevant to the association of viruses with their hosts. We further describe some initial results using pre-trained models on the named-entity recognition (NER) task on this novel dataset. We propose that our dataset of manually annotated abstracts now offers a Gold Standard Corpus for training future NER models in the automated extraction of host-pathogen detection methods from scientific publications, and further explain how our work makes first steps towards predicting the important human health-related concept of viral spillover risk automatically from the scientific literature.