CL IR LG NEFeb 12, 2015

Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation

Jose Antonio Miñarro-Giménez, Oscar Marín-Alonso, Matthias Samwald

arXiv:1502.03682v13.129 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of keeping biomedical knowledge bases up-to-date, but it is incremental as it tests an existing method on new data with modest results.

The study applied word2vec to medical corpora to identify pharmaceutical relationships from unstructured text, achieving a maximum accuracy of 49.28% compared to a manually curated ontology, indicating limited effectiveness for automatic knowledge base population.

BACKGROUND: The amount of biomedical literature is rapidly growing and it is becoming increasingly difficult to keep manually curated knowledge bases and ontologies up-to-date. In this study we applied the word2vec deep learning toolkit to medical corpora to test its potential for identifying relationships from unstructured text. We evaluated the efficiency of word2vec in identifying properties of pharmaceuticals based on mid-sized, unstructured medical text corpora available on the web. Properties included relationships to diseases ('may treat') or physiological processes ('has physiological effect'). We compared the relationships identified by word2vec with manually curated information from the National Drug File - Reference Terminology (NDF-RT) ontology as a gold standard. RESULTS: Our results revealed a maximum accuracy of 49.28% which suggests a limited ability of word2vec to capture linguistic regularities on the collected medical corpora compared with other published results. We were able to document the influence of different parameter settings on result accuracy and found and unexpected trade-off between ranking quality and accuracy. Pre-processing corpora to reduce syntactic variability proved to be a good strategy for increasing the utility of the trained vector models. CONCLUSIONS: Word2vec is a very efficient implementation for computing vector representations and for its ability to identify relationships in textual data without any prior domain knowledge. We found that the ranking and retrieved results generated by word2vec were not of sufficient quality for automatic population of knowledge bases and ontologies, but could serve as a starting point for further manual curation.

View on arXiv PDF

Similar