CL IR LGJan 12, 2021

AI- and HPC-enabled Lead Generation for SARS-CoV-2: Models and Processes to Extract Druglike Molecules Contained in Natural Language Text

Zhi Hong, J. Gregory Pauloski, Logan Ward, Kyle Chard, Ben Blaiszik, Ian Foster

arXiv:2101.04617v10.2Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the need for efficient drug candidate discovery for COVID-19 by automating molecule extraction from scientific literature, though it is incremental as it builds on existing NLP methods.

The researchers tackled the problem of identifying drug-like molecules for SARS-CoV-2 by developing a named entity recognition model trained on human-labeled text, which extracted 10,912 molecules from 198,875 papers in the CORD-19 corpus, achieving performance comparable to non-expert humans.

Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of coronavirus research. We report here on a project that leverages both human and artificial intelligence to detect references to drug-like molecules in free text. We engage non-expert humans to create a corpus of labeled text, use this labeled corpus to train a named entity recognition model, and employ the trained model to extract 10912 drug-like molecules from the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198875 papers. Performance analyses show that our automated extraction model can achieve performance on par with that of non-expert humans.

View on arXiv PDF Code

Similar