LG CLNov 10, 2020

Biomedical Information Extraction for Disease Gene Prioritization

Jupinder Parmar, William Koehler, Martin Bringmann, Katharina Sophia Volz, Berk Kapicioglu

arXiv:2011.05188v23.32 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of identifying disease-gene associations for drug target development, representing an incremental improvement by augmenting existing structured data with text-based extractions.

The authors tackled disease-gene prioritization by developing a biomedical information extraction pipeline that outperforms state-of-the-art methods in BioNLP and applied it to PubMed abstracts to extract protein-protein interactions. They demonstrated that augmenting these extractions to an existing knowledge graph increased novel disease-gene association predictions by 20% relative in hit@30.

We introduce a biomedical information extraction (IE) pipeline that extracts biological relationships from text and demonstrate that its components, such as named entity recognition (NER) and relation extraction (RE), outperform state-of-the-art in BioNLP. We apply it to tens of millions of PubMed abstracts to extract protein-protein interactions (PPIs) and augment these extractions to a biomedical knowledge graph that already contains PPIs extracted from STRING, the leading structured PPI database. We show that, despite already containing PPIs from an established structured source, augmenting our own IE-based extractions to the graph allows us to predict novel disease-gene associations with a 20% relative increase in hit@30, an important step towards developing drug targets for uncured diseases.

View on arXiv PDF

Similar