LGCLNov 10, 2020

Biomedical Information Extraction for Disease Gene Prioritization

arXiv:2011.05188v22 citations
AI Analysis

This work addresses the problem of identifying disease-gene associations for drug target development, representing an incremental improvement by augmenting existing structured data with text-based extractions.

The authors tackled disease-gene prioritization by developing a biomedical information extraction pipeline that outperforms state-of-the-art methods in BioNLP and applied it to PubMed abstracts to extract protein-protein interactions. They demonstrated that augmenting these extractions to an existing knowledge graph increased novel disease-gene association predictions by 20% relative in hit@30.

We introduce a biomedical information extraction (IE) pipeline that extracts biological relationships from text and demonstrate that its components, such as named entity recognition (NER) and relation extraction (RE), outperform state-of-the-art in BioNLP. We apply it to tens of millions of PubMed abstracts to extract protein-protein interactions (PPIs) and augment these extractions to a biomedical knowledge graph that already contains PPIs extracted from STRING, the leading structured PPI database. We show that, despite already containing PPIs from an established structured source, augmenting our own IE-based extractions to the graph allows us to predict novel disease-gene associations with a 20% relative increase in hit@30, an important step towards developing drug targets for uncured diseases.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes