Predicting Disease-Gene Associations using Cross-Document Graph-based Features
This work addresses the need for more accurate text mining methods in personalized medicine to complement structured databases, though it is incremental as it builds on existing graph-based and machine learning techniques.
The paper tackled the problem of identifying disease-gene associations from text by addressing limitations of simple co-occurrence methods, such as spurious links and inability to aggregate cross-document knowledge, and achieved a 30-point increase in F1 score over a baseline on the Genetic Testing Registry.
In the context of personalized medicine, text mining methods pose an interesting option for identifying disease-gene associations, as they can be used to generate novel links between diseases and genes which may complement knowledge from structured databases. The most straightforward approach to extract such links from text is to rely on a simple assumption postulating an association between all genes and diseases that co-occur within the same document. However, this approach (i) tends to yield a number of spurious associations, (ii) does not capture different relevant types of associations, and (iii) is incapable of aggregating knowledge that is spread across documents. Thus, we propose an approach in which disease-gene co-occurrences and gene-gene interactions are represented in an RDF graph. A machine learning-based classifier is trained that incorporates features extracted from the graph to separate disease-gene pairs into valid disease-gene associations and spurious ones. On the manually curated Genetic Testing Registry, our approach yields a 30 points increase in F1 score over a plain co-occurrence baseline.